Cryptic description of input/output requirements in yaml file

ewaldvandyk commented 9 months ago

Good day.

Thank you for this tool. I have a few questions.

Can you please clarify the input/output requirements for RIBO-former a little better? For example, I'm not sure if the fasta files should be on the transcriptome or genome level.

Also I'm not sure if the .h5 file should be an input or output (to be created) file. For example, if I don't specify the .h5 it in the yaml file, riboformer complains, if I specify it, Riboformer complains that it cannot read the file. So either this is an input file required and I'm uncertain what it should contain, or ribformer failed, didn't write an output to the .h5 file and later tried to load it and failed.

Output of error messages:


ENST00000624828.1...
0it [00:00, ?it/s]
ENST00000436045.1...
0it [00:00, ?it/s]
Save data in hdf5 files...
ERROR:root:Traceback (most recent call last):
  File "/Users/e.v.dijk/.venv_py311/lib/python3.11/site-packages/transcript_transformer/data.py", line 72, in process_seq_data
    f = save_transcriptome_to_h5(f, data_dict)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/e.v.dijk/.venv_py311/lib/python3.11/site-packages/transcript_transformer/data.py", line 158, in save_transcriptome_to_h5
    key, data=array, dtype=f"<S{max(1,max([len(s) for s in array]))}"
                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: max() arg is an empty sequence

Traceback (most recent call last):
  File "/Users/e.v.dijk/.venv_py311/bin/riboformer", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/Users/e.v.dijk/.venv_py311/lib/python3.11/site-packages/transcript_transformer/riboformer.py", line 84, in main
    process_seq_data(
  File "/Users/e.v.dijk/.venv_py311/lib/python3.11/site-packages/transcript_transformer/data.py", line 79, in process_seq_data
    shutil.copy(h5_path, backup_path)
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/shutil.py", line 419, in copy
    copyfile(src, dst, follow_symlinks=follow_symlinks)
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/shutil.py", line 256, in copyfile
    with open(src, 'rb') as fsrc:
         ^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'my_experiment.h5'```

Thanks in advance!

jdcla commented 9 months ago

Hello,

Thank you for your feedback!

It seems I have to indeed clarify the files. The .h5 file is a database that is created when parsing the assembly information and ribosome profiling data. RIBO-former actually only supports parsing of genome-level human assemblies and sequence information. Check the test/ folder for some example snippets.

genome-level assemblies have been necessary to group together transcript isoforms per chromosome during the training of the models to prevent training on multi-mapped reads that could also be in the validation/test sets (i.e. overfitting) . It furthermore provides a lot of information about exons and genome coordinates that are included in the final table. For example, genome coordinates are required to detect CDS variant ORF types.

In conclusion, I am not sure not having genome-level information is a good idea for RIBO-former.

I clarified the README.md to reflect your feedback. I hope this solves your problem.

NicolasProvencher commented 9 months ago

Hi, I am having the same issue. I am using ensembl Homo_sapiens.GRCh38.109.chr.gtf file I am also using a cat of all the chromosome from ensembl genomic files

I was able to retrace the source of the issue above to None values in the parsed dictionnary named data_dict in the values for keys genome_name, tag, and support level Since the parsing from gtfparse added None value to places where the values where unavalaible, when the code tried to mesure the len of a None value to assign it the correct space in the h5 object it crashed

Now, I am not very familiar with h5 objects so i dont know the best way to fix this, in the meanwhile im testing the code with None replace by 'None' in the dict

Hope this help

Nicolas

jdcla commented 9 months ago

Hi Nicholas, Could you send me some more info. e.g. the exact error trace of your python code and potentially even the gtf/fa files. I'll look into it asap.

The error from Ewald is clearly the result of parsing transcripts rather than chromosomes (see his output). Normally, the code is flexible enough to be able to handle missing fields for tag and transcript_support_level. I am not sure which field genome_name refers to, because it does not exist in the code.

Maybe open a separate issue because I don't think the two are related.

jdcla commented 9 months ago

@ewaldvandyk Are you still experiencing issues?

ewaldvandyk commented 8 months ago

I'm not experiencing this particular issue anymore. However I did run into problems when riboseq data was aligned to the genome instead of the transcriptome. Therefore it seems as though the reference fasta file need to be relative to the genome (relative to chromosomes), whereas the actual riboseq sam/bam files need to be aligned to the transcriptome. To summarise: Reference fasta file -> relative to genome
gtf file -> matched to .fa file (i.e. make sure it is relative to GRCh38) ribo profiles -> bam files aligned to transcriptome (not genome). I've only tested .bam, not .sam files. Is this correct?

I have a few extra questions/comments:

Do you have a rule of thumb indicating how much RAM memory I need? I suppose this depends on the number of samples and read depth of each riboseq sample. So far I consistently underestimated this requirement (via slurm) and get jobs killed after many hours (I've tried up to 50GB max memory).
It seems that RIBO-former package management only supports pip. I prefer to use miniconda and when I created a virtual environment for RIBO-former, one of the first packages I installed was riboformer itself and it automatically installed dependencies via. pip (numpy, etc), which causes serious dependency issues later on (clashes with miniconda). I eventually solved this problem by making a list of all riboformer dependancies, installed as many as possible with conda, and then installed riboformer with pip and now everything seems to be working (minus the RAM requirement issue).

I greatly appreciate the help. Best regards, Ewald

jdcla commented 8 months ago

Your summary is correct. To answer your questions:

RAM is actually only needed to parse the data from the .bam files. I use biobear to load these files in fully. Once loaded in the hdf5 format, the data is loaded per sample/batch. As such, RAM requirements are mainly dependent on the largest bam file being parsed. When using HPC infrastructure, it might be easier to simply parse the files first using a local server if available (running the script with the --data flag), and do the processing afterwards on the cluster.
I personally do not like conda due to my experience of conda not being able to solve environments for older packages. If you simply create custom environments for different tools/processing pipelines, you solve your problem. Essentiallly, create an environment for just running RIBO-former.

jdcla commented 7 months ago

I'm closing this issue now as it seems the original problem is solved.

TRISTAN-ORF / RiboTIE

Cryptic description of input/output requirements in yaml file #4