comprna / reorientexpress

Transcriptome long-read orientation with Deep Learning
MIT License
9 stars 4 forks source link

Error when running the training mode with annotation #2

Closed akramdi closed 5 years ago

akramdi commented 5 years ago

Hi,

I am running reorientexpress for the first time and I have trouble training the model with my annotation file. Here's my command and the error:

Command:

python $SOURCE/reorientexpress/reorientexpress.py -train \
-data transcripts.fa  \
-format fasta \
-source annotation \
--v -output annotation_model

Error

Using TensorFlow backend.

----Starting Training Pipeline----

Traceback (most recent call last):
  File "/kingdoms/a2e/programs/Tools/reorientexpress/reorientexpress.py", line 743, in <module>
    options.oh)
  File "/kingdoms/a2e/programs/Tools/reorientexpress/reorientexpress.py", line 544, in build_kmer_model
    sequences = read_annotation_data(path = path_data, trimming = trimming, n_reads = n_reads, use_all_annotation = use_all_annotation)
  File "/kingdoms/a2e/programs/Tools/reorientexpress/reorientexpress.py", line 341, in read_annotation_data
    read_type = sline[-1].split(':')[1]
IndexError: list index out of range

transcripts.fa is a fasta file I get from a GTF file using gffread. Here's how it looks:

>AT1G01010.1 gene=NAC001 CDS=130-1417
AAATTATTAGATATACCAAACCAGAGAAAACAAATACATAATCGGAGAAATACAGATTACAGAGAGCGAG
AGAGATCGACGGCGAAGCTCTTTACCCGGAAACCATTGAAATCGGACGGTTTAGTGAAAATGGAGGATCA
....

Is it due the fasta format? Is there a specific format for the header?

Thanks a lot for the help!

Best, Amira

angelrure commented 5 years ago

Hi,

The issue was related with the header, as we have been working with files that had other formats.

Now it should work with any header.

Let us know if now it works as expected for you.

Sorry for the inconvenience!

Ángel

akramdi commented 5 years ago

Hi Angel,

Thanks for the fix! After updating the scripts and running the same command, I am confronted now with a different error message:

Preparing the data
Assuming the data provided is all in forward
Traceback (most recent call last):
  File "/kingdoms/a2e/programs/Tools/reorientexpress/reorientexpress.py", line 755, in <module>
    options.oh)
  File "/kingdoms/a2e/programs/Tools/reorientexpress/reorientexpress.py", line 561, in build_kmer_model
    data, labels = prepare_data(sequences, order, full_counting, ks, False, path_paf, only_last_kmer=only_last_kmer, reverse_all = reverse_all, one_hot = one_hot)
  File "/kingdoms/a2e/programs/Tools/reorientexpress/reorientexpress.py", line 177, in prepare_data
    sequences_reverse = sequences_reverse.apply(reverse_complement)
  File "/import/bc_users/a2e/kramdi/vienv/env3/lib/python3.5/site-packages/pandas/core/series.py", line 3591, in apply
    mapped = lib.map_infer(values, f, convert=convert_dtype)
  File "pandas/_libs/lib.pyx", line 2217, in pandas._libs.lib.map_infer
  File "/kingdoms/a2e/programs/Tools/reorientexpress/reorientexpress.py", line 75, in reverse_complement
    return ''.join([complement[base] for base in dna[::-1]])
  File "/kingdoms/a2e/programs/Tools/reorientexpress/reorientexpress.py", line 75, in <listcomp>
    return ''.join([complement[base] for base in dna[::-1]])
KeyError: 'Y'

I could send you the fasta file I use if it helps.

Also, I had to comment a code line that prints the transcript sequences one by one, I am guessing it was there for debug purposes only (line 331 in reorientexpress.py).

Thanks! Amira

angelrure commented 5 years ago

Hi,

looking at the error I think your fasta file has nucleotide symbols the program wasn't expecting (namely an Y).

I've done a quick fix to discard sequences with non-standard nucleotide symbols. Please, pull/clone the repository again and let me know if now it works.

In case it doesn't please provide your input so I can check what's the problem.

Thanks,

Ángel

akramdi commented 5 years ago

It works now, thanks!

Best, Amira