GoekeLab / xpore

Identification of differential RNA modifications from nanopore direct RNA sequencing
https://xpore.readthedocs.io/
MIT License
132 stars 22 forks source link

readGTF split issue #85

Closed obenno closed 9 months ago

obenno commented 2 years ago

Hi,

Thanks a lot for developing xpore!

I'm working on some non-model organisms, and testing xpore with our own pipeline. The transcripts were assembled with stringtie, and xpore was installed by conda install, the package info is:

# packages in environment at /home/ubuntu/Tools/miniconda3/envs/xpore:
#
# Name                    Version                   Build  Channel
xpore                     2.0                pyh5e36f6f_0    bioconda

and we found stringtie output gtf file was not compatible with xpore. The trackback was like below:

Traceback (most recent call last):
  File "/home/ubuntu/Tools/miniconda3/envs/xpore/bin/xpore", line 10, in <module>
    sys.exit(main())
  File "/home/ubuntu/Tools/miniconda3/envs/xpore/lib/python3.9/site-packages/xpore/scripts/xpore.py", line 67, in main
    options.func(options)
  File "/home/ubuntu/Tools/miniconda3/envs/xpore/lib/python3.9/site-packages/xpore/scripts/dataprep.py", line 692, in dataprep
    gtf_dict = readGTF(gtf_path_or_url)
  File "/home/ubuntu/Tools/miniconda3/envs/xpore/lib/python3.9/site-packages/xpore/scripts/dataprep.py", line 184, in readGTF
    tx_id=ln[-1].split('; transcript_id "')[1].split('";')[0]

By checking the readGTF function, I found this issue may due to a "non-standard" split of gtf attributes (column 9) in readGTF function: https://github.com/GoekeLab/xpore/blob/8722c06314d1fec90dde85347186612144fab6e6/xpore/scripts/dataprep.py#L184-L185

The stringtie gtf has transcript_id as the first attribute, but gene_id as the second, which is not the same as human hg38 gtf file.

example stringtie.gtf:

Gm01    StringTie   transcript  120160  131940  .   +   .   transcript_id "TCONS_00000001"; gene_id "XLOC_000001"; gene_name "MSTRG.3"; xloc "XLOC_000001"; cmp_ref "Glyma.01G000600.2.Wm82.a4.v1"; class_code "="; cmp_ref_gene "Glyma.01G000600.Wm82.a4.v1"; tss_id "TSS1";
Gm01    StringTie   exon    120160  122559  .   +   .   transcript_id "TCONS_00000001"; gene_id "XLOC_000001"; exon_number "1";
Gm01    StringTie   exon    131469  131940  .   +   .   transcript_id "TCONS_00000001"; gene_id "XLOC_000001"; exon_number "2";

Hope this helps, thanks! obenno

yuukiiwa commented 2 years ago

Hi obenno,

Thank you for reporting this bug! The readGTF() function was written based on the ensembl gtf files, which have transcript_id after the gene_id. I will update the code accordingly and make it more flexible for other kinds of gtf files. Will update you once the code is updated!

Best wishes, Yuk Kei

obenno commented 2 years ago

Hi @yuukiiwa,

Thanks a lot for your prompt response.

If you don't mind, I could do a PR ^^

Best regards, obenno

yuukiiwa commented 2 years ago

Hi obenno,

Thank you so much for offering to submit a PR! @ploy-np will update the dev branch. Once the dev branch is updated (we will update you on that to avoid merging conflict), you can then submit a PR to the dev branch. Thank you!!

Best wishes, Yuk Kei

yuukiiwa commented 2 years ago

Hi @obenno,

The dev branch is updated! It will be great if you can submit a PR to the dev branch. Thank you!!

Best wishes, Yuk Kei

obenno commented 2 years ago

Hi @yuukiiwa,

Thank you for the updates!

The PR was submitted as https://github.com/GoekeLab/xpore/pull/87. The code was tested on both the demo data and my own data. By the way, it seems that the fasta and gtf in the your demo.tar.gz are not valid. I used GRCh38 gtf and cDNA fasta (release 91) downloaded from ensembl's FTP server.

Best regards, obenno

yuukiiwa commented 2 years ago

Hi obenno,

Thank you for the quick update! We will review and likely merge in the PR by the end of this week! The zenodo demo.tar.gz link is updated but not the link on the readthedocs yet.

Best wishes, Yuk Kei

obenno commented 2 years ago

Hi @yuukiiwa,

Thank you and please feel free to close this issue.