CGATOxford / CGATPipelines

Collection of CGAT NGS Pipelines
MIT License
43 stars 18 forks source link

local variable ensembl_no referenced before assignment #25

Closed AndreasHeger closed 9 years ago

AndreasHeger commented 9 years ago
File "/ifs/apps/apps/python-2.7.9/lib/python2.7/site-packages/ruffus/task.py", line 743, in run_pooled_job_without_exceptions
                                                         return_value =  job_wrapper(param, user_defined_work_func, register_cleanup, touch_files_only)
                                                       File "/ifs/apps/apps/python-2.7.9/lib/python2.7/site-packages/ruffus/task.py", line 541, in job_wrapper_io_files
                                                         ret_val = user_defined_work_func(*param)
                                                       File "/ifs/mirror/jenkins/CGATPipelines/CGATPipelines/pipeline_mapping.py", line 363, in buildReferenceGeneSet
                                                         if ensembl_no >= 78:
                                                     UnboundLocalError: local variable 'ensembl_no' referenced before assignment
AndreasHeger commented 9 years ago

Hi @TomSmithCGAT , Please fix.

Also, do not work with ENSEMBL version numbers. Instead test if the attributes you need are present and if not fall back to something else. And respect the boundaries between pipelines - do not try to guess what another pipeline has done, but only work with the data exported.

(The error comes about because I use an annotation directory that follows a different naming convention).

Best wishes, Andreas

TomSmithCGAT commented 9 years ago

@AndreasHeger Thanks for the advice. The issue is that "source" column of the GTF file (column 2) no longer contains the expected values, e.g "processed_transcript" but now contains the actual source of the annotation, e.g "havana". The "processed_transcript" string is now contained in the attributes as a gene biotype which is lost during the creation of the geneset_all.gtf.gz file.

This can either be rectified in the annotations pipeline so that all final gtfs are as expected regardless of ensembl version number, or rectified downstream where the difference in ensembl gtf causes a problem. It makes more sense to me to modify pipeline_annotations. What do you think?

AndreasHeger commented 9 years ago

Hi @TomSmithCGAT , the purpose of pipeline_annotations is to provide genomic annotations in a standardized format - so yes, dealing with the effects of different ENSEMBL versions needs to be implemented in pipeline_annotations. The correct way is not to patch, but to generalize the current implementation such that both old and new ENSEMBL can be accommodated.

Best wishes, Andreas

TomSmithCGAT commented 9 years ago

@AndreasHeger - OK, I'll try and make pipeline_annotiations consistent between ensembl versions.

AndreasHeger commented 9 years ago

Thanks! Happy to help.

TomSmithCGAT commented 9 years ago

Issue resolved by modifying GTF.Entry class and gtf2gtf.py to retain gene_biotype tag in attributes of merged GTFs. Removed patch from pipeline_mapping.py and PipelineMapping.py.

Changes made on following branches: TS-pipeline_annotations_constistency_ENSEMBL_version (CGATPipelines) TS-Add_re-source_method_to_gff2gff (cgat)

AndreasHeger commented 9 years ago

Thanks, this has been fixed.