baoxingsong / dCNS

conserved non-coding sequence
MIT License
14 stars 4 forks source link

how to get genomic coordinates for both the query genome and ref genome 2.0? #17

Closed xuyinsheng closed 2 years ago

xuyinsheng commented 3 years ago

Sorry to disturb you again. Now I have a indirect idea to get the CNSs method.

save cigar and do not merge data. bamToBed -i 5.bam -cigar| bedtools sort -i > 5.bed bedtools subtract -a 5.bed -b ns_megered.bed > 5_nons.bed

I could get a bed file (5_nons.bed) like this. (Using my own genome as a test)

image

There are 5 lines, we could notice the second line and the third lines belong to one CNS, the 4th and 5th lines belong to one CNS. They were cut because of "Ns".

I will use the query_chr information (column 4) and cigar start coordinate (just the number before H) as input to asign each to the query gene, then save it like this " ref_chr ref_start ref_end query_gene"

Finally, I will get "ref_gene ref_gene_CNS_start ref_gene_CNS_end query_gene". And using bedtools to extract sequences from ref_genome. And using minimap to align the fasta to query genome to get the coordinates for query genome.

Does it can come true and meaningful?

baoxingsong commented 3 years ago

I am not sure what is your aim. By reformating sam files into bed files, you will lose some information. But depending on your aim, that may do not matter.

"using minimap to align the fasta to query genome to get the coordinates for query genome." is not a good idea for me. In our manuscript, we highlighted the sensitivity of our approach and explained why we do not like those k-mer trigger approaches for this type of analysis.

The query coordinates and sequences are in the SAM files. In my point of view, you may like to spend more time on those SAM files. BTW, IGV is a good tool to view genome and SAM/BAM files.

xuyinsheng commented 3 years ago

Thank you very much. I get it now. I have already performed dCNS for my species. Thanks!!!