Closed AndreasHeger closed 9 years ago
Hi @TomSmithCGAT , Please fix.
Also, do not work with ENSEMBL version numbers. Instead test if the attributes you need are present and if not fall back to something else. And respect the boundaries between pipelines - do not try to guess what another pipeline has done, but only work with the data exported.
(The error comes about because I use an annotation directory that follows a different naming convention).
Best wishes, Andreas
@AndreasHeger Thanks for the advice. The issue is that "source" column of the GTF file (column 2) no longer contains the expected values, e.g "processed_transcript" but now contains the actual source of the annotation, e.g "havana". The "processed_transcript" string is now contained in the attributes as a gene biotype which is lost during the creation of the geneset_all.gtf.gz file.
This can either be rectified in the annotations pipeline so that all final gtfs are as expected regardless of ensembl version number, or rectified downstream where the difference in ensembl gtf causes a problem. It makes more sense to me to modify pipeline_annotations. What do you think?
Hi @TomSmithCGAT , the purpose of pipeline_annotations is to provide genomic annotations in a standardized format - so yes, dealing with the effects of different ENSEMBL versions needs to be implemented in pipeline_annotations. The correct way is not to patch, but to generalize the current implementation such that both old and new ENSEMBL can be accommodated.
Best wishes, Andreas
@AndreasHeger - OK, I'll try and make pipeline_annotiations consistent between ensembl versions.
Thanks! Happy to help.
Issue resolved by modifying GTF.Entry class and gtf2gtf.py to retain gene_biotype tag in attributes of merged GTFs. Removed patch from pipeline_mapping.py and PipelineMapping.py.
Changes made on following branches: TS-pipeline_annotations_constistency_ENSEMBL_version (CGATPipelines) TS-Add_re-source_method_to_gff2gff (cgat)
Thanks, this has been fixed.