htseq_counts cannot accept gtf files with 'gene_id' field

ewels / clusterflow

A pipelining tool to automate and standardise bioinformatics analyses on cluster environments.

https://ewels.github.io/clusterflow/

GNU General Public License v3.0

97 stars 27 forks source link

htseq_counts cannot accept gtf files with 'gene_id' field #39

Closed stu2 closed 9 years ago

stu2 commented 9 years ago

Hi Phil,

I've been trying to use htseq_counts.cfmod with the mouse genome .gtf file from mm10. In this gtf file the gene is marked by gene_id not ID. However, ID is hardcoded into the command that htseq_counts produces (line 103: -i 'ID' ), so it chokes. Could this be made into an argument for flexibility?

Cheers

edit: storing the ID tag in the clusterflow genome info for that genome/gtf set might be even better.

ewels commented 9 years ago

Hey,

Yeah of course. Whilst I agree that it would be nice to store the tag with the genome info as it will always be the same for that reference GTF file, I'm a bit hesitant to start adding extra information to that file as I think it could get very complicated very quickly.

Parameters would be the obvious choice, defaulting to ID if not provided. Or better still, is there any way that the module can scan the gtf file and look for a hierarchy of tags? eg. try to find gene_id if it's not there, then look for ID (and so on, with a list of potential tags). Or alternatively have a regex that looks for ensembl ENSG\d+ gene ID and pull out the associated identifier?

What do you think? Maybe a combination of all of these things.

Phil

stu2 commented 9 years ago

Adding an ID parameter could be a rapid fix. The hierachy of tags idea would be the nicest option. But for that, I think it could aid extensibility to put the code for guessing the gtf ID tag info outside the htseq_counts module, like in the helper mod or something.

ewels commented 9 years ago

Is the gtf ID tag needed for any other modules?

stu2 commented 9 years ago

Not that I know of. I haven't tested Tophat but I think it must have it's own internal parser.

ewels commented 9 years ago

Cool, I'll have a stab at writing it. I think I'll keep it in the htseq-counts module for now, to keep the Helpers file clean. It's easy enough to move it out into Helpers if another module needs the code (this is how just about every function in there started out).

ewels commented 9 years ago

Hi @stu2 - that last commit should have addressed the problem hopefully.

I've set it so that the tag with the parameter id_tag - if this isn't present, it looks for gene_id in the file, then ID. If in doubt, it falls back to ID. Hope this is ok - let me know if there are more tags that it should look for.

As you can probably see, I did this amongst a big batch of other unrelated work. I'll be testing and merging this soon.

Phil