galaxyproject / galaxy

Data intensive science for everyone.
https://galaxyproject.org
Other
1.38k stars 992 forks source link

Bed to gff3 #1270

Open hexylena opened 8 years ago

hexylena commented 8 years ago

https://biostar.usegalaxy.org/p/2083/ apparently this is the feature request which spawned our IRC channel ... 5 years ago (thanks @natefoo). It also hasn't been solved and I recently tried to feed the output of this tool to a gff3 tool and realised what was going on. A little googling later revealed that nothing has been done yet.

souravsingh commented 8 years ago

I would like to work on this.Where do I start?

hexylena commented 8 years ago

@souravsingh the existing bed2gff tool is https://github.com/galaxyproject/galaxy/blob/dev/tools/filters/bed_to_gff_converter.py

This would involve:

Thanks for your interest! :D

souravsingh commented 8 years ago

@erasche Would it be alright if I make use of tools from BioPython for the conversion or do I need to modify the existing file?

hexylena commented 8 years ago

Hi @souravsingh it would be easiest to get the contribution in if you didn't use any of the external dependencies, I'm afraid.

apcamargo commented 5 years ago

In case anyone is still interested, I modified the Galaxy script to convert BED files to GFF3, instead of GFF2. I updated the script to Python 3 and included argparse for the CLI interface.

https://gist.github.com/apcamargo/50edadc6ab13bfafd05eb4b887f8dd6d

nsoranzo commented 5 years ago

@apcamargo I think there would be interest for a pull request.

See also https://github.com/galaxyproject/galaxy/issues/7771

jennaj commented 5 years ago

Having a BED > GFF3 converter would be great. Sorry if naive question -- but we plan to keep both conversion options, correct? We need both conversions available, or just GTF, for the reasons below.

Many tools do not consume GFF3 (only actual GTF content, sometimes given a "GFF" datatype). Some accept both GTF/GFF3. Tools can be unclear in the input datatype filter area and filter for just "gff" -- which could mean any of GFF, GTF (GFF2), or GFF3.

Main point: Any implicit conversion from bed-to-someGF* format would need to be smart enough to know what the tool is expecting, and use the right converter, or we will see a LOT of tool errors. Or the tool input select filters will need to be more explicit and never use just "GFF" but the more specific datatypes of "GTF" and/or "GFF3".

We would probably avoid other user-problems by having more specific datatype filters for these types -- GTF and GFF3 are not the same. Plus, if a GTF is uploaded that includes a header (somewhat common in public data), the datatype "gff" is auto-detected and assigned. Tools then consume that input and fail because of the header lines. Users need to remove them first in a distinct step (Select tool). It gets complicated...

To consider: If a user could no longer use a dataset with the generic "gff" datatype assigned, they would be alerted that there is a problem, before running a tool that doesn't know how to parse the header out. A small learning curve and potentially some workflow changes.

Alternative: It was discussed if Upload should strip GTF header lines upon Upload (possibly splitting the header off to a distinct dataset) or maybe if all tools should be modified to strip out the GTF header lines at runtime (better for the user). Both have pos/neg impact.

But that ^^ won't fix GTF/GFF3 input mixups, especially if an implicit conversion is not converting to the correct format "behind the screen".

Some more gory details in here: https://github.com/galaxyproject/usegalaxy-playbook/issues/45

cc @jmchilton