Open Averstic opened 2 years ago
Would really be interested in this feature in JB2 as it is an essential way of sharing data with non-computational scientists.
thanks for adding interest in this @Averstic
what is the main feature that you generally use this for? is it the GFF export of a region?
Yes indeed, the GFF export of a region would be of interest. Along with the possibility to extract the reference sequence of the current view, this in order to quickly extract sequence and annotation for import in other tools.
@cmdcolin Just interested, would you consider this feature as a hard feature to create?
@Averstic it could be somewhat challenging. there are two general approaches which are not necessarily mutually exclusive
1) exporting chunks of the original source data file, which is more true to the original data but might not work well with things like REST APIs
2) making a general data export system where any feature can be translated to some data format. This system is basically how jbrowse 1 does it, but there are odd corner cases in this that can be difficult to handle properly e.g. choosing what the appropriate file formats are for a given track, generating accurate serializations of arbitrary data, integrating with our plugin system etc. this general system can result in data conversions like vcf to gff, bam to gff, bigwig to bed or other things like that which could be good or bad depending on your point of view:)
Thank you for this insight.
On option 1, it would still be necessary to recalculate some of the coordinates of the original source file, as those are genome wide coordinates, while it would be more useful to be able to extract 'local' coordinates.
On option 2. what would be the reason this feature was not carried over between the two versions? And how likely is it that this feature would make it to any future release?
On option 1, it would still be necessary to recalculate some of the coordinates of the original source file, as those are genome wide coordinates, while it would be more useful to be able to extract 'local' coordinates.
what type of workflow would want this transformation? just a note that jbrowse 1 did not do transformations like this.
On option 2. what would be the reason this feature was not carried over between the two versions? And how likely is it that this feature would make it to any future release?
I think it is possible this could make it into a future release. it wasn't intentional to not carry it over, just a limitation of dev resources
With regard to the work mentioned in https://github.com/GMOD/jbrowse-components/pull/3439, this is the feedback from the WormBase user:
Hi, Scott. Thank you so much for you and jbrowse2 developer to develop this prototype so quickly. To test this prototypes, I download region I:3250911..3307532 from both jbrowse2 and gbrowse in C. elegans. This jbrowse2 genbank is close, but still not work.
I make some suggestions as follows:
Items 1 and 2 above are pretty easy, and item 4 could be differences in annotation release but might be something. Item three is a little more tricky; I assume the CDS features need to be stashed when iterating through the features and I don't really know how to do that in JB/React. I may take a crack at 1 and 2 though.
that is all very good feedback. (3), the CDS join, is attempted in the save_track_data branch but not sure if he received the same
see packages/core/pluggableElementTypes/models/components/genbank.ts
welcome to try out further work on that branch
the wormbase user updated his comment to say that he was using the wrong track which is why he didn’t get the results he expected and now all is well; his only additional comment would be that it would be nice for the downloaded file name to include the location in the name to prevent name collisions.
New update from the WormBase user where he found what looks like a bug. From https://community.alliancegenome.org/t/genbank-format-downloading-from-jbrowse1-2/6772/11:
user:
mRNA complement(13673..15502)
/gene="gene:Cnig_chr_X.g24897"
/name=transcript:Cnig_chr_X.g24897
/id="transcript:Cnig_chr_X.g24897"
/info="method:InterPro accession:IPR013750 description:GHMP kinase, C-terminal domain
method:InterPro accession:IPR014721 description:Ribosomal protein S5 domain 2-type fold, subgroup
method:InterPro accession:IPR015192 description:Switch protein XOL-1, N-terminal
method:InterPro accession:IPR015193 description:Switch protein XOL-1, GHMP-like
method:InterPro accession:IPR020568 description:Ribosomal protein S5 domain 2-type fold"
/jbrowse_parent="gene:Cnig_chr_X.g24897"
/Name="Cnig_chr_X.g24897"
CDS complement(join(15426..15502,15288..15369,15060..15242,14642..14750,14435..14594,14020..14389,13673..13972))
/mRNA="transcript:Cnig_chr_X.g24897"
I found a bug. The CDS feature is not recognized in the above genebank. This error may originate from long multiple lines info in mRNA feature.
Me:
Interesting, if you manually take out the carriage returns in the “info” does it then work? I’m trying to figure out what we need to do generally, since that info section can frequently be quite long.
User:
mRNA complement(13673..15502)
/gene="gene:Cnig_chr_X.g24897"
/name=transcript:Cnig_chr_X.g24897
/id="transcript:Cnig_chr_X.g24897"
/info="method:InterPro accession:IPR013750 description:GHMP kinase, C-terminal domain
method:InterPro accession:IPR014721 description:Ribosomal protein S5 domain 2-type fold, subgroup
method:InterPro accession:IPR015192 description:Switch protein XOL-1, N-terminal
method:InterPro accession:IPR015193 description:Switch protein XOL-1, GHMP-like
method:InterPro accession:IPR020568 description:Ribosomal protein S5 domain 2-type fold"
/jbrowse_parent="gene:Cnig_chr_X.g24897"
/Name="Cnig_chr_X.g24897"
CDS complement(join(15426..15502,15288..15369,15060..15242,14642..14750,14435..14594,14020..14389,13673..13972))
/mRNA="transcript:Cnig_chr_X.g24897"
The above format worked.
@cmdcolin I don't think changes to implement this ^^^ made it into the last PR (where the "method" lines are spaced over to the rest of the text); I took a look at the code diffs for that branch and didn't see the obvious place to make a change, so I'm going to have to ask you to do it too.
To reproduce:
The resulting genbank output looks like:
LOCUS CM008514.1:14313335..14315164 1830 bp DNA linear UNK 20-MAR-2023
FEATURES Location/Qualifiers
gene complement(1..1830)
/name=gene:Cnig_chr_X.g24897
/biotype="protein_coding"
/id="gene:Cnig_chr_X.g24897"
/Name="Cnig_chr_X.g24897"
mRNA complement(1..1830)
/gene="gene:Cnig_chr_X.g24897"
/name=transcript:Cnig_chr_X.g24897
/id="transcript:Cnig_chr_X.g24897"
/info="method:InterPro accession:IPR013750 description:GHMP kinase, C-terminal domain
method:InterPro accession:IPR014721 description:Ribosomal protein S5 domain 2-type fold, subgroup
method:InterPro accession:IPR015192 description:Switch protein XOL-1, N-terminal
method:InterPro accession:IPR015193 description:Switch protein XOL-1, GHMP-like
method:InterPro accession:IPR020568 description:Ribosomal protein S5 domain 2-type fold"
/jbrowse_parent="gene:Cnig_chr_X.g24897"
/Name="Cnig_chr_X.g24897"
CDS complement(join(1754..1830,1616..1697,1388..1570,970..1078,763..922,348..717,1..300))
/mRNA="transcript:Cnig_chr_X.g24897"
ORIGIN
1
but it needs to look like:
LOCUS CM008514.1:14313335..14315164 1830 bp DNA linear UNK 20-MAR-2023
FEATURES Location/Qualifiers
gene complement(1..1830)
/name=gene:Cnig_chr_X.g24897
/biotype="protein_coding"
/id="gene:Cnig_chr_X.g24897"
/Name="Cnig_chr_X.g24897"
mRNA complement(1..1830)
/gene="gene:Cnig_chr_X.g24897"
/name=transcript:Cnig_chr_X.g24897
/id="transcript:Cnig_chr_X.g24897"
/info="method:InterPro accession:IPR013750 description:GHMP kinase, C-terminal domain
method:InterPro accession:IPR014721 description:Ribosomal protein S5 domain 2-type fold, subgroup
method:InterPro accession:IPR015192 description:Switch protein XOL-1, N-terminal
method:InterPro accession:IPR015193 description:Switch protein XOL-1, GHMP-like
method:InterPro accession:IPR020568 description:Ribosomal protein S5 domain 2-type fold"
/jbrowse_parent="gene:Cnig_chr_X.g24897"
/Name="Cnig_chr_X.g24897"
CDS complement(join(1754..1830,1616..1697,1388..1570,970..1078,763..922,348..717,1..300))
/mRNA="transcript:Cnig_chr_X.g24897"
ORIGIN
1
potentially the issue highlighted above points to a need to re-urlencode things before writing out, to gff/genbank but may be useful for the upstream wormbase-pipeline to not have newlines
So given that the problem I cited above is really with the GFF (and will hopefully be fixed with the next WB release), do you feel comfortable pushing this into main, or do you want to add the re-encoding first? I don't have a strong opinion since it shouldn't be a problem for "well behaved" gff.
ETA: Oh, but if the GFF that gets dumped out isn't getting URI encoded, that would be kind of a problem. I guess that should be dealt with first.
I think that I would like this PR to improve architecturally and code quality wise before merge to main. it is a good proof of concept but may help to evolve a little bit before merge. I can keep this branch updated with main so you can keep using it
also, if possible, keep the discussion of the particular PR on the PR page
Discussed in https://github.com/GMOD/jbrowse-components/discussions/2810