GMOD / sars-cov-2-jbrowse

Repo for storing dockerfile and config for a coronavirus genome browser
MIT License
1 stars 2 forks source link

Provide raw data and scripts #2

Open jszinger opened 4 years ago

jszinger commented 4 years ago

Please provide links to the raw data and the scripts necessary to format them for JBrowse. I would like to set up my own instance using data of known provenance and a proven chain of custody. Pulling prefomatted data from the cloud does not meet this requirement.

scottcain commented 4 years ago

Hi @jszinger ,

I'm guessing you're referring to the SRA tracks: the raw data for each track can be obtained by following the link in the "about this track" dialog box, which you get from mousing over the track label and clicking on menu down triangle. Also mentioned in the "about" box is a link to the analysis performed by the Galaxy people, from whom I got the VCF files. While the url is to the top level of the analysis repo, you can get more information about the variant analysis by digging a few directories down to the variant readme: https://github.com/galaxyproject/SARS-CoV-2/blob/master/genomics/4-Variation/README.md

The only thing I did to the VCF files after getting them from the Galaxy folks was to change the name of the reference sequence (in Galaxy they used "NC_045512" in JBrowse I used "NC045512.2") and then filter out variants with a frequency of less than 1%, which I did with a simple perl one liner: `perl -ni.bak -e 'if ($=~/^NC045512.2/ and $=~/AF=0.00/) {next;} else {print;}' *.vcf` and then bgzip and tabix index them so JBrowse could read them.

Is that what you're looking for? Scott

jszinger commented 4 years ago

I'm actually asking about the other tracks: CDS, Genes, primers and multi alignment. For example, there's a bunch of processing that needs to happen to https://www.ncbi.nlm.nih.gov/nuccore/NC_045512 before it can be displayed by JBrowse---I wish to know the details of retreival and processing.

Thanks, Jim

scottcain commented 4 years ago

Ah, OK. The data processing for that is relatively straight forward. It would require getting the fasta and GFF files for NC_045512 from the page you linked to by clicking on the "send to" link, and selecting complete record, file for the destination, and then selecting FASTA and GFF3 from the drop down menu for format.

Once you have the files, you first run bin/prepare-refseqs.pl --fasta <name of fasta file> in the jbrowse directory (that you either got by downloading from jbrowse.org or doing a git clone https://github.com/GMOD/jbrowse.git and followed the build instructions, that basically boil down to running ./setup.sh). That creates the "reference sequence" track in JBrowse.

Then you can process data from the GFF3 file to get tracks for genes and CDS. The command generally looks like bin/flatfile-to-json.pl --gff <filename> --type <gene or CDS> --trackType CanvasFeatures --key <genes or CDS> --label <genes or CDS> This command generates a set of json files that JBrowse uses to display the gene and CDS tracks. Display changes I made that are the defaults (like stealing the color scheme for the CDS features from NextStrain) are encoded in the trackList.json file. (The trackList.json file is created when you run the prepare-refseqs script and is added to when you run flatfile-to-json.)

The primers tracks resulted from me "scraping" the primer sequences from the linked resources and using the "Add sequence search track" for each primer sequence so that I could identify the coordinates and writing a GFF3 file by hand and processing it with flatfile-to-json similar to above. The primers.gff file in this repo is the result of those searches.

The multialignment track I know I little bit less about: The BED file I used was created by @cmdcolin and I just grabbed the data. I know that it was fairly straight forward, using data obtained from GenBank for all SARS-CoV-2 sequences and then downloading them as a multialignment fasta file and then processing into a BED file that is then tabix indexed. Yes, that feels a little hand-wavy; perhaps @cmdcolin can fill in a little bit of detail if you like. I added the track configuration for this track to the trackList.json file by hand.

This is a fairly brief overview but should do the job of letting you know how the data were processed. If you want to do something similar, please feel free to email the JBrowse mailing list at gmod-ajax@lists.sourceforge.net or hit us up in Gitter: https://gitter.im/GMOD/jbrowse

scottcain commented 4 years ago

@jszinger ,

If you feel like the above descriptions are adequate, let me know and I can add them to the "about this track" for each track where it makes sense.