how to get genomic coordinates for both the query genome and ref genome?

baoxingsong / dCNS

conserved non-coding sequence

MIT License

14 stars 4 forks source link

how to get genomic coordinates for both the query genome and ref genome? #16

Closed xuyinsheng closed 2 years ago

xuyinsheng commented 2 years ago

Hello,

I just followed this pipline to repeat the same analysis. I found I could only get the ref genome coordinates and lost the query genome information. So, how do you solve this problem? Could you please share the scripts with me?

And, waht's the meaning of this sencetence, " please NOTE. Each line in the bed file could NOT be interrupted as a CNS". Does the meaning same as "bedtools subtract -A".

Finally, the supplemental material you provided in your article only contained the coordinates for maize genome. So could you please share me with another CNS files contain both ref genome coordinates and query genome coordinates (just maize-sorghum is enrough)?

Hope to get your reply. Thank you very much!

baoxingsong commented 2 years ago

Thanks for your interest in our work.

The query genome coordinates are represented as hard-clip in the SAM files.

I parsed the SAM files for all the analyses across the manuscript. The bed files are requested by reviewers.

You could find those SAM files at https://figshare.com/articles/dataset/CNS/14129006

I use the HTSJDK library to parse those SAM files. https://samtools.github.io/htsjdk/ Here is an example for the code I used (sorry, it is not well documented): https://github.com/baoxingsong/CNSpublication/blob/master/CNS_analysis/MyCNSanalysis/src/me/songbx/mains/CNSProfile.java

xuyinsheng commented 2 years ago

Thank you very much to answer my question and give me suggestions how to analysis the sam files.

I just downloaded the sam file for sorghum. I notiiced a problem. For example, just 32th line in sorghum.sam.

10 0 10 119969835 56 4341710H10M2D2M4D1M1D10M4I8M1I2M263D66I5M1D7M1D6M1I10M 0 0 ATATGAAATAAAGCAAAGGGAAAACACAAGTAACTGACnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnCTGCAAACAAGAAAGAAATGTTTTCAGAA

In my understanding, Ns represent high frequency k-mer or CDS sequences. So, for this example, there are 66I (the number of N is 66 in query genome). And I also noticed there are 263D just in fornt of 66I (263D66I). I guessed that might be 263 n in the ref genome (maize). So these two situations could over-estimate the CNS lengths both in query genome (66) and ref genome (263).

So, how to deal with this problem to get the exact correspondence without "N". Did you have any suggestions? Hope to get your reply. Thank you very much!

baoxingsong commented 2 years ago

Yes, those long indels result from masking.

I counted the total number of 'M's in the cigar strings as the total length of CNS. Also, please note that some CNSs overlap with each other on the reference genome, and you need to unique those reference genome matches. "Aligning the genome sequence of sorghum against that of maize generated the smallest CNS space (67.07 Mbp, by counting matched base pairs in the maize genome)."

I focused on the reference genome and did not perform much analysis on the query genome.

xuyinsheng commented 2 years ago

OK , now I know more about your tool and your research. Sincerely thank you for your answer. Thank you very much!