Open linusdsv opened 4 years ago
Hi Linus, Sorry for the late reply,
I have not used diploid or phased assemblies as MUM&Co essentially assumes a 1-1 pairing So unfortunately I have not come across this before to a great extent
The difference between the beginning and end of alignments in the query is generally MUMmer alignment based, i.e. sometimes alignment halts prior to the actual breaksite. However in the deletion test set I do not find this problem, the greatest difference i have for the query coordinates is 5bp (since the genomes are identical apart from the deletions) Could you send me your output for the test dataset?
Are you using MUMmer4?
Coordinates are 1-based. Each value corresponding to a base-pair.
Thanks Samuel
Hi Samuel,
Thank you very much for your response.
I am doing the alignment between two chromosomes, with one being the paternal chromosome and the other the maternal, which I guess does not differ much (in concept) from the data that have been processed previously with Mum&Co, right?
What is curious to me is that when I call the SVs in Assemblytics (standard settings), which you also used for your benchmark tests, I did not get these extremely long query gap sizes for deletions, which indicates to me that this trend seems to be rooted in the way Mum&Co is processing the MUMmer output. I created some dotplots aligning the regions where these deletions are located and their flanking regions on the two chromosomes. For the deletions with very long query gap size, I could not see any alignments of the flanking regions, while for the deletions with small query gap sizes, I could observe the alignments.
Yes, I am using MUMmer4.
I added the test dataset output below. As mentioned, my results agree with your observation, the greatest difference for query coordinates is 5 bp as well.
Best regards, Linus
Hi again,
Understood, I thought you meant aligning both homologous chromosomes at the same time against a reference. You are right, this is as most generic situations
First for the test data, you said you saw the same problems "I also observed this in your yeast.tidy dataset, although to a much lesser extent" Did you meant the differences between 1-5 bp? as otherwise I see no major differences
For the large differences in your data, perhaps the clue is here: "When I run my pairs of allelic pseudomolecules, I get the pattern described above for around 30% of the deletions. Additionally, if I align one contig (e.g. 2-4 Mb) to the complete allelic pseudomolecule, the size specified in the last two columns is always less than 50 bp and very often in the single digits. What may be the reason for this different outcome when aligning the full versus one part of the chromosome?" By this you mean that when you do whole genome alignments you find these large gaps in the query positions but not when you only align contig by contig to a whole genome? The calls for insertions and deletions are meant to be only within a single contig, but perhaps overlapping contigs are having an impact on the way the coordinates are determined...or something else... Could you send me your alignment files for both the whole genome and a contig that you used in your test?
Thanks Linus Samuel
Hi Samuel,
Yes, sorry for the confusion, I did not mean any other pattern than the deletions that sometimes have a query gap size of up to 5. So the data produced for the test set should be the same as yours.
Yes, the query gap sizes seem to be smaller when I extract the longest contig from one chromosome and align it to the other chromosome, however this of course may also happen as a side effect from selecting just a subset of the data, as there are still deletion gap sizes of >100 bp.
Below are the TSV files for both cases: Mum&Co_Output.zip
If you want to look at delta files, here are the ones for the single contigs (the others are too large to upload). Delta_Files_One_Contig.zip
If you need more files, feel free to ask for them.
Thanks a lot for the help, Linus
Hi Linus, Sorry for the late reply,
I meant can you send me the entire alignment file produced for a whole genome alignment and a single contig alignment Where there are large gaps in the contig coordinates when doing the whole genome alignment and where this large gap disappears when aligning the single contig Perhaps this happens in the longest?
Thanks
Is this still a problem?
Hello,
I have been using Mum&Co to call SVs between two homologous chromosomes of a diploid whole-genome assembly. What I noticed in the output was that in case of some deletions, the position of the deletion in the query sequence spans several hundred to several thousand basepairs. An example:
In the first case, there is a region of 75 kb on chr1, which corresponds to a region of 163186 bp in chr_s1. Other examples with the same situation are listed as well. However, in the last row, there is an example where the deletion of 2361 bp corresponds to 1 base in the query.
I also observed this in your yeast.tidy dataset, although to a much lesser extent. Do you have any possible explanation for this? Ideally, the deletion would of course span zero bases in the query. When I run my pairs of allelic pseudomolecules, I get the pattern described above for around 30% of the deletions. Additionally, if I align one contig (e.g. 2-4 Mb) to the complete allelic pseudomolecule, the size specified in the last two columns is always less than 50 bp and very often in the single digits. What may be the reason for this different outcome when aligning the full versus one part of the chromosome?
Also, just to be sure, are the coordinates in your TSV files zero-based?
Thank you and best regards, Linus