Problem with GTF file when running ribo-former

JC-therea commented 1 year ago

Good morning,

I'm currently facing an issue with your program. I've managed to install it successfully, but when using my own data, I'm encountering some problems. The error message I'm encountering is as follows:

$ riboformer template.yaml
Loading assembly data...
INFO:root:Extracted GTF attributes: ['gene_id', 'ID', 'Name', 'transcript_id', 'Parent', 'original_biotype']
Traceback (most recent call last):
  File "/opt/miniconda3/envs/ribo-former/bin/riboformer", line 8, in <module>
    sys.exit(main())
  File "/opt/miniconda3/envs/ribo-former/lib/python3.10/site-packages/transcript_transformer/riboformer.py", line 46, in main
    process_data(args)
  File "/opt/miniconda3/envs/ribo-former/lib/python3.10/site-packages/transcript_transformer/data.py", line 63, in process_data
    f = parse_transcriptome(f, args.gtf_path, args.fa_path)
  File "/opt/miniconda3/envs/ribo-former/lib/python3.10/site-packages/transcript_transformer/data.py", line 83, in parse_transcriptome
    gtf = gtf.with_columns(pl.col("exon_number").cast(pl.Int32, strict=False))
  File "/opt/miniconda3/envs/ribo-former/lib/python3.10/site-packages/polars/internals/dataframe/frame.py", line 6602, in with_columns
    self.lazy()
  File "/opt/miniconda3/envs/ribo-former/lib/python3.10/site-packages/polars/internals/lazyframe/frame.py", line 1438, in collect
    return pli.wrap_df(ldf.collect())
exceptions.ColumnNotFoundError: exon_number

Error originated just after this operation:
DF ["seqname", "source", "feature", "start"]; PROJECT */14 COLUMNS; SELECTION: "None"

I work with yeast, and therefore, the GTF file I'm using follows the same format as the human GTF, but the attribute column is different; it contains less information in my case. Is the 'exon_number' field taken into account during the execution of ribo-former? If this parameter is indeed required, do you think there might be a way to parse the GTF file to correctly add the field 'exon_number' ?

Thank you very much for your assistance.

JC-therea commented 1 year ago

If you have any toy data to test your program in my workstation it would be great too.

jdcla commented 1 year ago

The exon_number field is currently required as the code obtains the the transcript sequences from the fasta file and expects introns to be present. Could you send me a link to the assembly files that you are using, I might take a look to see how they are formatted. I think I might be able to fix this on my side by simply allowing no exon number to be present if only one entry of exons (feature = exon) is found for a given transcript ID. Thanks for your feedback.

JC-therea commented 1 year ago

Perhaps more fields are missing because I don't know which fields are needed in the attribute column of the gtf... It would be great if you could provide an example of the GTF format you are utilizing to maybe fix it by my side.

Nonetheless, here is included the link to the files I am currently using: https://drive.google.com/drive/folders/1N5FgqSZDGSURKGUj0YojXXiVAiUUvFTD?usp=drive_link

jdcla commented 1 year ago

The whole study is build upon the Ensembl human genome, which definitely works. However, I have allowed flexibility for several fields, so not all fields provided are necessary. Specifically, all information that would be necessary to parse sequence information from the fasta file is required, such as strand, start, end, exon_number. Additionally, it is necessary to have a transcript_id.

edit: start_codon is also a required field.

jdcla commented 1 year ago

I've included some toy data in the RIBO_former repository. Please also update transcript_transformer, as I have fixed a regression bug.

I looked at your gtf and there are definitely multiple transcripts with multiple exons. I believe there is a way forward to parse GTF files without using the exon number feature, as exons can be sorted is either through using the ID field or by simply looking at the coordinates and strand information. The latter seems more robust.

However, I'm somewhat hesitant to do this at this point. Your gtf file furthermore has the problem of not containing any start codon fields, which is the feature used to obtain the posive set (Translation initiation sites). I'm not sure how common it is to have assemblies without the exon_number and start_codon feature. For the human genome, not every coding sequence has a start codon feature. If I remember correctly, this is because not every transcript isoform is valid (wrong splicing etc.)

I assume these gtf files have been generated through some program locally? Could you share how you obtained these?

jdcla commented 1 year ago

There is another problem, namely that the script currently does not support non-human transcriptome fine-tuning. More precise, it does not search for seqnames that are not part of the human transcriptome.

There is no inherent blocker to use the pre-trained model for non-human data, but I have currently implemented the code in a way that it only looks at and splits the data according to the seqnames the models were pre-trained on (https://github.com/jdcla/transcript_transformer/blob/main/transcript_transformer/pretrained/riboformer_models/50perc_06_23.yml).

I added this feature to the roadmap and hope to implement this soon. See the README.md to track this.

JC-therea commented 1 year ago

Hello! Thank you for continuing to assist with the issue I've encountered. The GTF file I'm using is an adaptation of a GFF3 file obtained from the most up-to-date S. pombe genome page at https://www.pombase.org/data/genome_sequence_and_features/gff3/. As you can see, it doesn't contain much of the information you mentioned. It's a limitation due to the data types I'm working with, which are inherently less comprehensive than human annotations.

I generated the GTF using AGAT (https://github.com/NBISweden/AGAT/tree/master), which is quite robust for making these types of modifications.

For now, I'll attempt to use the toy data to gain a more detailed understanding of how the program operates.

Once again, thank you very much for the effort you've put into helping me with this issue.

jdcla commented 1 year ago

You're welcome. I appreciate the feedback. I will keep this issue open as it is relevant to address the exon_number problem.

jdcla commented 1 month ago

This has now been fixed. Additionally, it is now possible to pre-train new models.

TRISTAN-ORF / RiboTIE

Problem with GTF file when running ribo-former #2