SAMtoBAM / MUMandCo

MUM&Co is a simple bash script that uses Whole Genome Alignment information provided by MUMmer (only v4) to detect Structural Variation
GNU General Public License v3.0
64 stars 14 forks source link

Size of deletions #5

Open linusdsv opened 4 years ago

linusdsv commented 4 years ago

Hello,

I have been using Mum&Co to call SVs between two homologous chromosomes of a diploid whole-genome assembly. What I noticed in the output was that in case of some deletions, the position of the deletion in the query sequence spans several hundred to several thousand basepairs. An example:

ref_chr query_chr       ref_start       ref_stop        size    SV_type query_start     query_stop
chr1    chr_s1  6025961 6101923 75962   deletion        993756  1156942
chr1    chr_s1  6102123 6102195 72      deletion        1157141 1181170
chr1    chr_s1  6103092 6104354 1262    deletion        1182068 1290816
chr1    chr_s1  6104835 6153341 48506   deletion        1291294 1371023
chr1    chr_s1  6153863 6154203 340     deletion        1371539 1371914
chr1    chr_s1  6155165 6162825 7660    deletion        1372885 1422600
chr1    chr_s1  6163903 6244138 80235   deletion        1423676 1538981
chr1    chr_s1  6244596 6244818 222     deletion        1539439 1539661
chr1    chr_s1  6245401 6307265 61864   deletion        1540244 1628838
chr1    chr_s1  6825101 6864750 39649   deletion        1739211 1739889
chr1    chr_s1  6865801 6934365 68564   deletion        1740286 1751443
chr1    chr_s1  8978749 8981110 2361    deletion        3301351 3301351

In the first case, there is a region of 75 kb on chr1, which corresponds to a region of 163186 bp in chr_s1. Other examples with the same situation are listed as well. However, in the last row, there is an example where the deletion of 2361 bp corresponds to 1 base in the query.

I also observed this in your yeast.tidy dataset, although to a much lesser extent. Do you have any possible explanation for this? Ideally, the deletion would of course span zero bases in the query. When I run my pairs of allelic pseudomolecules, I get the pattern described above for around 30% of the deletions. Additionally, if I align one contig (e.g. 2-4 Mb) to the complete allelic pseudomolecule, the size specified in the last two columns is always less than 50 bp and very often in the single digits. What may be the reason for this different outcome when aligning the full versus one part of the chromosome?

Also, just to be sure, are the coordinates in your TSV files zero-based?

Thank you and best regards, Linus

SAMtoBAM commented 4 years ago

Hi Linus, Sorry for the late reply,

I have not used diploid or phased assemblies as MUM&Co essentially assumes a 1-1 pairing So unfortunately I have not come across this before to a great extent

The difference between the beginning and end of alignments in the query is generally MUMmer alignment based, i.e. sometimes alignment halts prior to the actual breaksite. However in the deletion test set I do not find this problem, the greatest difference i have for the query coordinates is 5bp (since the genomes are identical apart from the deletions) Could you send me your output for the test dataset?

Are you using MUMmer4?

Coordinates are 1-based. Each value corresponding to a base-pair.

Thanks Samuel

linusdsv commented 4 years ago

Hi Samuel,

Thank you very much for your response.

I am doing the alignment between two chromosomes, with one being the paternal chromosome and the other the maternal, which I guess does not differ much (in concept) from the data that have been processed previously with Mum&Co, right?

What is curious to me is that when I call the SVs in Assemblytics (standard settings), which you also used for your benchmark tests, I did not get these extremely long query gap sizes for deletions, which indicates to me that this trend seems to be rooted in the way Mum&Co is processing the MUMmer output. I created some dotplots aligning the regions where these deletions are located and their flanking regions on the two chromosomes. For the deletions with very long query gap size, I could not see any alignments of the flanking regions, while for the deletions with small query gap sizes, I could observe the alignments.

Yes, I am using MUMmer4.

I added the test dataset output below. As mentioned, my results agree with your observation, the greatest difference for query coordinates is 5 bp as well.

DEL100_test.SVs_all.zip

Best regards, Linus

SAMtoBAM commented 4 years ago

Hi again,

Understood, I thought you meant aligning both homologous chromosomes at the same time against a reference. You are right, this is as most generic situations

First for the test data, you said you saw the same problems "I also observed this in your yeast.tidy dataset, although to a much lesser extent" Did you meant the differences between 1-5 bp? as otherwise I see no major differences

For the large differences in your data, perhaps the clue is here: "When I run my pairs of allelic pseudomolecules, I get the pattern described above for around 30% of the deletions. Additionally, if I align one contig (e.g. 2-4 Mb) to the complete allelic pseudomolecule, the size specified in the last two columns is always less than 50 bp and very often in the single digits. What may be the reason for this different outcome when aligning the full versus one part of the chromosome?" By this you mean that when you do whole genome alignments you find these large gaps in the query positions but not when you only align contig by contig to a whole genome? The calls for insertions and deletions are meant to be only within a single contig, but perhaps overlapping contigs are having an impact on the way the coordinates are determined...or something else... Could you send me your alignment files for both the whole genome and a contig that you used in your test?

Thanks Linus Samuel

linusdsv commented 4 years ago

Hi Samuel,

Yes, sorry for the confusion, I did not mean any other pattern than the deletions that sometimes have a query gap size of up to 5. So the data produced for the test set should be the same as yours.

Yes, the query gap sizes seem to be smaller when I extract the longest contig from one chromosome and align it to the other chromosome, however this of course may also happen as a side effect from selecting just a subset of the data, as there are still deletion gap sizes of >100 bp.

Below are the TSV files for both cases: Mum&Co_Output.zip

If you want to look at delta files, here are the ones for the single contigs (the others are too large to upload). Delta_Files_One_Contig.zip

If you need more files, feel free to ask for them.

Thanks a lot for the help, Linus

SAMtoBAM commented 4 years ago

Hi Linus, Sorry for the late reply,

I meant can you send me the entire alignment file produced for a whole genome alignment and a single contig alignment Where there are large gaps in the contig coordinates when doing the whole genome alignment and where this large gap disappears when aligning the single contig Perhaps this happens in the longest?

Thanks

SAMtoBAM commented 3 years ago

Is this still a problem?