GenomeRIK / tama

Transcriptome Annotation by Modular Algorithms (for long read RNA sequencing data)
GNU General Public License v3.0
125 stars 24 forks source link

tama_format_id_filter.py : IndexError: list index out of range #101

Closed olechnwin closed 1 year ago

olechnwin commented 1 year ago

Hi Richard, I was trying to re-arrange the ID line of my bed file that was generated by merging a liftoff and Iso-Seq annotations:

python ~/opt/tama/tama_merge.py -f filelist.txt -d merge_dup -p merged_annos_a673_2 -s gencode
python ~/opt/tama/tama_go/format_converter/tama_format_id_filter.py -b merged_annos_a673_2.bed \
        -o merged_annos_a673_2_filt.bed \
        -s custom -r 3,4,1,2 -d ";"

This is the error I got:

python /opt/tama/tama_go/format_converter/tama_format_id_filter.py -b merged_annos_a673_2.bed -o merged_annos_a673_2_filt.bed -s custom -r 3,4,1,2 -d ';'
opening bed file
Traceback (most recent call last):
  File "/opt/tama/tama_go/format_converter/tama_format_id_filter.py", line 272, in <module>
    new_output_line,output_flag = id_parser(id_line)
  File "/opt/tama/tama_go/format_converter/tama_format_id_filter.py", line 246, in id_parser
    new_output_list.append(id_split[reshufle_index])
IndexError: list index out of range

Here is an example lines from merged_annos_a673_2:

scaffold_1      96339   96600   G5;G5.2;ENSG00000226722.4;ENST00000653691.1;ENSG00000226722.4;ENST00000663265.1 40      -       96339   96600   255,0,0 1       261     0
scaffold_1      97708   222786  G6;G6.1 40      +       97708   222786  200,255,0       3       192,280,2103    0,2369,122975
scaffold_1      97709   101811  G6;G6.2 40      +       97709   101811  255,200,0       4       191,74,120,554  0,2368,2528,3548

How do I fix this error?

Thank you for your help! Cen

GenomeRIK commented 1 year ago

Hi Cen,

Use the default "-s ensembl_merge". The custom field re-arrangement can only be used if all lines have the same number of ID fields which is not the case in your file.

Thank you, Richard

olechnwin commented 1 year ago

Hi Richard,

Thank you so much for your help. As always truly appreciate you spending the time to reply.

Best, Cen

olechnwin commented 1 year ago

Hi Richard,

I'm very sorry. I meant to post it here. My bad. I'm going to delete that other post. To keep it in the same thread, here is the image where the novel transcript disappear after filtering. As shown below, the first two tracks are the merged annotations, and the third track which is after filter is missing the G50504. Is there a way to keep G50504?

image

I was running the same exact command with "-s ensembl_merge"

python ~/opt/tama/tama_go/format_converter/tama_format_gff_to_bed12_liftoff.py ${gff_dir}/${gff_name} ${gff_name/.gf
f3/.bed}
python ~/opt/tama/tama_merge.py -f filelist.txt -d merge_dup -p merged_annos_a673_2 -s gencode
python ~/opt/tama/tama_go/format_converter/tama_format_id_filter.py -b merged_annos_a673_2.bed \
        -o merged_annos_a673_2_filt.bed
python ~/opt/tama/tama_go/format_converter/tama_convert_bed_gtf_ensembl_no_cds.py \
        merged_annos_a673_2_filt.bed merged_annos_a673_2_filt.gtf
GenomeRIK commented 1 year ago

Hi Cen,

If you load all the bed and GTF files along the way to do see where the novel models drop out?

Also when you show the genome browser view the next time can you make sure all tracks are in expanded mode?

Thank you, Richard

olechnwin commented 1 year ago

Hi Richard,

Here is all the bed and GTF files. Turns out, the novel transcript was dropped during the conversion to GTF. Also, do you happen to know why the transcripts are in different colors ?

image

Thank you so much! Cen

edit: adding filelist and step of processing.

GenomeRIK commented 1 year ago

Hi Cen,

Sorry but could you annotate the image to indicate which track is showing which step of processing?

Thank you, Richard

olechnwin commented 1 year ago

Hi Richard, I have updated the figure above with the processing steps. Thank you, Cen

GenomeRIK commented 1 year ago

Hi Cen,

Ok I see the problem now. I have fixed the bug here for tama_format_id_filter.py. Could you update and try the new version?

The problem is that TAMA was not adding the TAMA ID's to the first 2 ID subfields so it was not being recognized by the GTF convertor.

As for the different coloured transcript models, that is a feature of TAMA Merge using the bed file to be able to show the source of origin by the colour. You can read about this in the wiki TAMA Merge page.

Thank you, Richard

olechnwin commented 1 year ago

Hi Richard,

Thank you so much for quickly fixing the problem. I will try the new version. It'll take me a while to try it though, as our HPC cluster has been swamped lately.

Thank you, Cen

olechnwin commented 1 year ago

Hi Richard,

I have tried the new version. It works! The missing novel transcript is now in the GTF file.

Thank you so much! Cen

GenomeRIK commented 1 year ago

Hi Cen,

Glad it's working for you now!

Thanks for using TAMA! Richard