GMOD / jbrowse-components

Source code for JBrowse 2, a modern React-based genome browser
https://jbrowse.org/jb2
Apache License 2.0
204 stars 61 forks source link

Export track data like in the Jbrowse 1 #3094

Open Averstic opened 2 years ago

Averstic commented 2 years ago

Discussed in https://github.com/GMOD/jbrowse-components/discussions/2810

Originally posted by **Marie-Lahaye** March 15, 2022 Hi, I was wondering if there was a way to export track data, like in the Jbrowse 1? I remember that we could export track data on the region that we were visualizing. For example exporting genes in GFF3 format from a specific region: ![image](https://user-images.githubusercontent.com/63525627/158348077-d54bf5e7-8ac0-4371-9887-de99e590f17a.png) Is there a similar feature with the Jbrowse 2? Thanks for any answer you have for me ! Marie
Averstic commented 2 years ago

Would really be interested in this feature in JB2 as it is an essential way of sharing data with non-computational scientists.

cmdcolin commented 2 years ago

thanks for adding interest in this @Averstic

what is the main feature that you generally use this for? is it the GFF export of a region?

Averstic commented 2 years ago

Yes indeed, the GFF export of a region would be of interest. Along with the possibility to extract the reference sequence of the current view, this in order to quickly extract sequence and annotation for import in other tools.

Averstic commented 2 years ago

@cmdcolin Just interested, would you consider this feature as a hard feature to create?

cmdcolin commented 2 years ago

@Averstic it could be somewhat challenging. there are two general approaches which are not necessarily mutually exclusive

1) exporting chunks of the original source data file, which is more true to the original data but might not work well with things like REST APIs

2) making a general data export system where any feature can be translated to some data format. This system is basically how jbrowse 1 does it, but there are odd corner cases in this that can be difficult to handle properly e.g. choosing what the appropriate file formats are for a given track, generating accurate serializations of arbitrary data, integrating with our plugin system etc. this general system can result in data conversions like vcf to gff, bam to gff, bigwig to bed or other things like that which could be good or bad depending on your point of view:)

Averstic commented 2 years ago

Thank you for this insight.

On option 1, it would still be necessary to recalculate some of the coordinates of the original source file, as those are genome wide coordinates, while it would be more useful to be able to extract 'local' coordinates.

On option 2. what would be the reason this feature was not carried over between the two versions? And how likely is it that this feature would make it to any future release?

cmdcolin commented 2 years ago

On option 1, it would still be necessary to recalculate some of the coordinates of the original source file, as those are genome wide coordinates, while it would be more useful to be able to extract 'local' coordinates.

what type of workflow would want this transformation? just a note that jbrowse 1 did not do transformations like this.

On option 2. what would be the reason this feature was not carried over between the two versions? And how likely is it that this feature would make it to any future release?

I think it is possible this could make it into a future release. it wasn't intentional to not carry it over, just a limitation of dev resources

scottcain commented 1 year ago

With regard to the work mentioned in https://github.com/GMOD/jbrowse-components/pull/3439, this is the feedback from the WormBase user:

Hi, Scott. Thank you so much for you and jbrowse2 developer to develop this prototype so quickly. To test this prototypes, I download region I:3250911..3307532 from both jbrowse2 and gbrowse in C. elegans. This jbrowse2 genbank is close, but still not work.

I make some suggestions as follows:

  1. To make snapgene recognizing this genebank, at least 5 spaces are required between feature key and coordinates. protein_coding_p1..4585 should be changed to protein_coding_p 1..4585
  2. It is better to only use genbank Feature Key, such as gene,CDS,exon,ncRNA, not just use protein_coding_p. The detailed “Standard Feature” explaination can be found at https://www.insdc.org/submitting-standards/feature-table/#7.2. In this link, Appendix II contains descriptions of all feature keys.
  3. The most important function we rely on in genbank is the joined coordinates for CDS`` feature key, like CDS join(16192…16313,16362…16851,16900…17084,17142…17409,17491…17889)```. All exons in the coding region of the gene are treated as the one feature, not multiple separate features. In that way, we can translate this CDS directly with snapgene and view amino acid sequence with snapgene.
  4. I found an inconsistent coordinates between jborwse2 and gborwse in the region I:3250911..3307532 I download. In jbrowse2 genbank format, coordinate of W01B11.2 is protein_coding_p 36569..42217. But in gbrowse, coordinate of W01B11.2 is CDS join(36470..36612,36774..36843,37398..37623,37675..38141,38922..39602,40358..40464,40584..40768,40817..40915,40996..41100,41492..41599,41725..41855,41899..42004, 42075..42217). The start coordinate is different between two version. 36569 for jbrowse2 and 36470 for gbrowse.
scottcain commented 1 year ago

Items 1 and 2 above are pretty easy, and item 4 could be differences in annotation release but might be something. Item three is a little more tricky; I assume the CDS features need to be stashed when iterating through the features and I don't really know how to do that in JB/React. I may take a crack at 1 and 2 though.

cmdcolin commented 1 year ago

that is all very good feedback. (3), the CDS join, is attempted in the save_track_data branch but not sure if he received the same

see packages/core/pluggableElementTypes/models/components/genbank.ts

https://github.com/GMOD/jbrowse-components/pull/3439/files#diff-2dd5f778cfc0e380e2b331d00f05c827002537cd874f7475f4ab4f44a28ad1cdR78-R91

welcome to try out further work on that branch

scottcain commented 1 year ago

the wormbase user updated his comment to say that he was using the wrong track which is why he didn’t get the results he expected and now all is well; his only additional comment would be that it would be nice for the downloaded file name to include the location in the name to prevent name collisions.

scottcain commented 1 year ago

New update from the WormBase user where he found what looks like a bug. From https://community.alliancegenome.org/t/genbank-format-downloading-from-jbrowse1-2/6772/11:

user:

     mRNA            complement(13673..15502)
                     /gene="gene:Cnig_chr_X.g24897"
                     /name=transcript:Cnig_chr_X.g24897
                     /id="transcript:Cnig_chr_X.g24897"
                     /info="method:InterPro accession:IPR013750 description:GHMP kinase, C-terminal domain 
method:InterPro accession:IPR014721 description:Ribosomal protein S5 domain 2-type fold, subgroup 
method:InterPro accession:IPR015192 description:Switch protein XOL-1, N-terminal 
method:InterPro accession:IPR015193 description:Switch protein XOL-1, GHMP-like 
method:InterPro accession:IPR020568 description:Ribosomal protein S5 domain 2-type fold"
                     /jbrowse_parent="gene:Cnig_chr_X.g24897"
                     /Name="Cnig_chr_X.g24897"
     CDS             complement(join(15426..15502,15288..15369,15060..15242,14642..14750,14435..14594,14020..14389,13673..13972))
                     /mRNA="transcript:Cnig_chr_X.g24897"

I found a bug. The CDS feature is not recognized in the above genebank. This error may originate from long multiple lines info in mRNA feature.

Me:

Interesting, if you manually take out the carriage returns in the “info” does it then work? I’m trying to figure out what we need to do generally, since that info section can frequently be quite long.

User:

     mRNA            complement(13673..15502)
                     /gene="gene:Cnig_chr_X.g24897"
                     /name=transcript:Cnig_chr_X.g24897
                     /id="transcript:Cnig_chr_X.g24897"
                     /info="method:InterPro accession:IPR013750 description:GHMP kinase, C-terminal domain 
                     method:InterPro accession:IPR014721 description:Ribosomal protein S5 domain 2-type fold, subgroup 
                     method:InterPro accession:IPR015192 description:Switch protein XOL-1, N-terminal 
                     method:InterPro accession:IPR015193 description:Switch protein XOL-1, GHMP-like 
                     method:InterPro accession:IPR020568 description:Ribosomal protein S5 domain 2-type fold"
                     /jbrowse_parent="gene:Cnig_chr_X.g24897"
                     /Name="Cnig_chr_X.g24897"
     CDS             complement(join(15426..15502,15288..15369,15060..15242,14642..14750,14435..14594,14020..14389,13673..13972))
                     /mRNA="transcript:Cnig_chr_X.g24897"

The above format worked.
scottcain commented 1 year ago

@cmdcolin I don't think changes to implement this ^^^ made it into the last PR (where the "method" lines are spaced over to the rest of the text); I took a look at the code diffs for that branch and didn't see the obvious place to make a change, so I'm going to have to ask you to do it too.

scottcain commented 1 year ago

To reproduce:

  1. Got to https://s3.amazonaws.com/agrjbrowse/test/save-track-data/index.html?session=share-vsTgjNX2Oi&password=5qEF3

  2. The resulting genbank output looks like:

    LOCUS       CM008514.1:14313335..14315164 1830 bp        DNA       linear    UNK 20-MAR-2023
    FEATURES             Location/Qualifiers
     gene            complement(1..1830)
                     /name=gene:Cnig_chr_X.g24897
                     /biotype="protein_coding"
                     /id="gene:Cnig_chr_X.g24897"
                     /Name="Cnig_chr_X.g24897"
     mRNA            complement(1..1830)
                     /gene="gene:Cnig_chr_X.g24897"
                     /name=transcript:Cnig_chr_X.g24897
                     /id="transcript:Cnig_chr_X.g24897"
                     /info="method:InterPro accession:IPR013750 description:GHMP kinase, C-terminal domain 
    method:InterPro accession:IPR014721 description:Ribosomal protein S5 domain 2-type fold, subgroup 
    method:InterPro accession:IPR015192 description:Switch protein XOL-1, N-terminal 
    method:InterPro accession:IPR015193 description:Switch protein XOL-1, GHMP-like 
    method:InterPro accession:IPR020568 description:Ribosomal protein S5 domain 2-type fold"
                     /jbrowse_parent="gene:Cnig_chr_X.g24897"
                     /Name="Cnig_chr_X.g24897"
     CDS             complement(join(1754..1830,1616..1697,1388..1570,970..1078,763..922,348..717,1..300))
                     /mRNA="transcript:Cnig_chr_X.g24897"
    ORIGIN
    1 

    but it needs to look like:

LOCUS       CM008514.1:14313335..14315164 1830 bp        DNA       linear    UNK 20-MAR-2023
FEATURES             Location/Qualifiers
     gene            complement(1..1830)
                     /name=gene:Cnig_chr_X.g24897
                     /biotype="protein_coding"
                     /id="gene:Cnig_chr_X.g24897"
                     /Name="Cnig_chr_X.g24897"
     mRNA            complement(1..1830)
                     /gene="gene:Cnig_chr_X.g24897"
                     /name=transcript:Cnig_chr_X.g24897
                     /id="transcript:Cnig_chr_X.g24897"
                     /info="method:InterPro accession:IPR013750 description:GHMP kinase, C-terminal domain 
                     method:InterPro accession:IPR014721 description:Ribosomal protein S5 domain 2-type fold, subgroup 
                     method:InterPro accession:IPR015192 description:Switch protein XOL-1, N-terminal 
                     method:InterPro accession:IPR015193 description:Switch protein XOL-1, GHMP-like 
                     method:InterPro accession:IPR020568 description:Ribosomal protein S5 domain 2-type fold"
                     /jbrowse_parent="gene:Cnig_chr_X.g24897"
                     /Name="Cnig_chr_X.g24897"
     CDS             complement(join(1754..1830,1616..1697,1388..1570,970..1078,763..922,348..717,1..300))
                     /mRNA="transcript:Cnig_chr_X.g24897"
ORIGIN
    1 
cmdcolin commented 1 year ago

potentially the issue highlighted above points to a need to re-urlencode things before writing out, to gff/genbank but may be useful for the upstream wormbase-pipeline to not have newlines

scottcain commented 1 year ago

So given that the problem I cited above is really with the GFF (and will hopefully be fixed with the next WB release), do you feel comfortable pushing this into main, or do you want to add the re-encoding first? I don't have a strong opinion since it shouldn't be a problem for "well behaved" gff.

ETA: Oh, but if the GFF that gets dumped out isn't getting URI encoded, that would be kind of a problem. I guess that should be dealt with first.

cmdcolin commented 1 year ago

I think that I would like this PR to improve architecturally and code quality wise before merge to main. it is a good proof of concept but may help to evolve a little bit before merge. I can keep this branch updated with main so you can keep using it

cmdcolin commented 1 year ago

also, if possible, keep the discussion of the particular PR on the PR page