Open hexylena opened 8 years ago
I would like to work on this.Where do I start?
@souravsingh the existing bed2gff tool is https://github.com/galaxyproject/galaxy/blob/dev/tools/filters/bed_to_gff_converter.py
This would involve:
Thanks for your interest! :D
@erasche Would it be alright if I make use of tools from BioPython for the conversion or do I need to modify the existing file?
Hi @souravsingh it would be easiest to get the contribution in if you didn't use any of the external dependencies, I'm afraid.
In case anyone is still interested, I modified the Galaxy script to convert BED files to GFF3, instead of GFF2. I updated the script to Python 3 and included argparse for the CLI interface.
https://gist.github.com/apcamargo/50edadc6ab13bfafd05eb4b887f8dd6d
@apcamargo I think there would be interest for a pull request.
See also https://github.com/galaxyproject/galaxy/issues/7771
Having a BED > GFF3 converter would be great. Sorry if naive question -- but we plan to keep both conversion options, correct? We need both conversions available, or just GTF, for the reasons below.
Many tools do not consume GFF3 (only actual GTF content, sometimes given a "GFF" datatype). Some accept both GTF/GFF3. Tools can be unclear in the input datatype filter area and filter for just "gff" -- which could mean any of GFF, GTF (GFF2), or GFF3.
Main point: Any implicit conversion from bed-to-someGF* format would need to be smart enough to know what the tool is expecting, and use the right converter, or we will see a LOT of tool errors. Or the tool input select filters will need to be more explicit and never use just "GFF" but the more specific datatypes of "GTF" and/or "GFF3".
We would probably avoid other user-problems by having more specific datatype filters for these types -- GTF and GFF3 are not the same. Plus, if a GTF is uploaded that includes a header (somewhat common in public data), the datatype "gff" is auto-detected and assigned. Tools then consume that input and fail because of the header lines. Users need to remove them first in a distinct step (Select tool). It gets complicated...
To consider: If a user could no longer use a dataset with the generic "gff" datatype assigned, they would be alerted that there is a problem, before running a tool that doesn't know how to parse the header out. A small learning curve and potentially some workflow changes.
Alternative: It was discussed if Upload should strip GTF header lines upon Upload (possibly splitting the header off to a distinct dataset) or maybe if all tools should be modified to strip out the GTF header lines at runtime (better for the user). Both have pos/neg impact.
But that ^^ won't fix GTF/GFF3 input mixups, especially if an implicit conversion is not converting to the correct format "behind the screen".
Some more gory details in here: https://github.com/galaxyproject/usegalaxy-playbook/issues/45
cc @jmchilton
https://biostar.usegalaxy.org/p/2083/ apparently this is the feature request which spawned our IRC channel ... 5 years ago (thanks @natefoo). It also hasn't been solved and I recently tried to feed the output of this tool to a gff3 tool and realised what was going on. A little googling later revealed that nothing has been done yet.