gaoyubang / nanom6A

MIT License
18 stars 11 forks source link

Display NA in gene name in genome_abandance.0.x.bed #12

Closed YCCHEN23 closed 2 years ago

YCCHEN23 commented 2 years ago

Hi, I had perform nanom6A analysis, and got 2490723 potentially-modified positions.

Most of the sites were recorded properly, however, I found some position display NA in gene name column (129 sites in my data),

Example: na

But in the same position, another reads display an correct gene name.

at1g24996

And there also some "NA" appeared in the regions out of transcripts.

I'm just wondering how these "NA" happened?

I pretty sure that my genome & transcriptome & gene annotation reference are all in the same version.

Thanks!

gaoyubang commented 2 years ago

image In the 245 line of predict_sites.py, if the reads can't map to the gene using bedtools ,then the gene name will be named with "NA"。 You can simply check the reads "343b2dba - 1af4- 4cde . a368.4b7cb4e0e86e . fast5" not in extract.bed6.gene and "562d3425.17b5- 4306 - 9a3d- ee1685f5b321. fast5" in extract.bed6.gene, look forward to your reply。

YCCHEN23 commented 2 years ago

Yes, the reads "343b2dba-1af4-4cde-a368-4b7cb4e0e86e.fast5" did not belong to any gene in the extract.bed6.gene file, and this read was aligned to the chr1 8798516 - 8798886 (no annotation within this region).

And I found that the m6A site at ch1.8798533 was not located in AT1G24996 (Chr1.8799520 - 8800410) too, but this m6A site was belong to AT1G24996. (Because "562d3425-17b5-4306-9a3d-ee1685f5b321.fast5" was aligned to the chr1 8798469 - 8800572 and it overlapped with AT1G24996)

So, if any part of the fast5 was overlapping with the gene.bed, the m6A sites on this read were all belong to the overlapped-gene, no matter how far it was. Am I right?

And I had another question:

Take AT1G01040.1 for example:

example

The fast5 reads of AT1G01040.1 were overlapping with AT1G01040, AT1G03993, AT1G01046, and AT1G01050 at the same time, so, all the m6A sites on AT1G01040.1 transcripts were assigned to all the four genes in my case.

I'm just simply removing all the m6A sites which out of its annotated regions in my results, but I can't deal with the following regions:

example2

Could I avoid these situation during predicte_sites.py? or I need to filter these sites out by myself at the present stage?

Thanks for your reply!

gaoyubang commented 2 years ago

I have update a version to solve the bug by map the reads to reference transcriptome.

YCCHEN23 commented 2 years ago

Glad to hear the new update!

I'll re-analysis my data with the latest version,

I'll close the issue within few days if there has no further question or error after running.

Many thanks!

YCCHEN23 commented 2 years ago

Hi @gaoyubang

Thanks for the update version, I had performed the latest nanom6A, the problem of the overlapping genes was solved, and non of the m6A site was assigned to the wrong gene.

However, some problems still happened in my case: few reads were "mapped to the right transcript" but obtained "wrong genome coordinate" in the result.

Take one of my fast5 read as example, "ba16bcba-6337-486a-9a8e-f7480cb2e7a5.fast5":

The read was perfectly mapped to AT1G25175.1 (I checked the extract.reference.sort.bam and also BLAST the fast5 to NCBI).

bamex fast5_blast

According to the information in gtf file, the location of the full length transcript is ranged in chr1:8826983-8829682. However, in the "extract.sort.bam" file, the genome coordinate of this read was ranged in chr1:8775947-8777576, which largely shifted from the right position.

Then I check if all the reads mapped to this transcript were all shifted from the right place, but it didn't, still many reads which mapped to AT1G25175.1 were assigned to the right genome coordinate.

Willing to provide more details if needed, thanks!

Best regards YCCHEN

gaoyubang commented 2 years ago

This phenomenon is caused by the align algorithm of minimap2. A simple verification is that you align the reference gene to the reference genome, and you will find that 99% of the aligned positions are consistent with GFF, but 1% is different. This error cannot be solved temporarily. You can filter it by yourself if necessary.

YCCHEN23 commented 2 years ago

Okay, I got it, many thanks for all the help~