biocodellc / dipnet-fims

deprecated: dipnet front end for fims
1 stars 0 forks source link

Documentation update for Path forward for FASTQ submission via Biosample #9

Open jdeck88 opened 8 years ago

jdeck88 commented 8 years ago

See wiki document for biosample repository. Needs to be updated with more description and get comments back from PIs so we have path forward.

jdeck88 commented 8 years ago

Meeting on August 15, 2016: Options for storing DIPNET data file formats:

  1. Store the legacy data (Sanger data in fasta format) on iDigBio. We don't anticipate large scale and continued contributions of legacy data to the database so the 1TB space should be adequate. Pros: No redundancy on the public database Cons: This will probably involve a level of curation by the DIPnet coordinator not in other scenarios but again we don't anticipate loads more Sanger data coming in so it may be minimal.
  2. Convert fasta files (legacy data) into fastq format with quality scores of 40 for each bp and upload to SRA. Pros: All the data will be in one place Cons: There will be redundancy within the public database with much of the legacy data already on the NCBI nucleotide database.
  3. Create two systems to handle the two data types Pros: If we could create a streamlined system, I can imagine a system where we upload/download both Sanger and NG data using the same portal. Cons: This would involve designing two independent upload systems=more time and effort (precious resources).

There are a lot of nuisances to be discussed here and we really want to get this right the first time!!!! So I would like to schedule a skype this week or the next to hash ideas and settle on a solution. If you don't have time within the next week or two than please register your opinion/concerns by email.


All: In favor of option #1. Folks are not needing to access Sanger data in Genbank.

All: Decision: Data must be FASTA or FASTQ.

Query: FASTA and FASTQ don’t have to be UNIONED.

Accomodating FASTQ: — need size selections — will need to add fields / options for FASTQ data.. flesh this out in a couple of emails. — different projects for RAD, shot-gun sequencing are possible, each one has different required fields.

Interface/Indexing: — ElasticSearch and Query Interface go together and responsibility of Biocode

Uploading: — if not fully automatic, then at least send back SEQUIN file for manual upload. Look at this as a tool to make SEQUIN easy.

October 1 for first draft of FASTQ loader.

jdeck88 commented 8 years ago

And more information from Michelle on this:

As discussed at our last meeting we've come up with the additional fields for NG data. These fields would be completed when the fastq files are uploaded and would be tied to the sequence data and not part of the metadata per se.

One catch-There will be two types of data uploaded: single end (SE) and paired end (PE) data. The difference has to do with how the sequencer is run and determines if one (SE) or two (PE) files is produced. The files are in the same format and have the same naming for each sequence with the R1 file containing reads from one side of the molecule and R2 file containing sequence reads from the opposite ends of the molecule. In the case of PE data both files need to be uploaded.

We'll want to include a couple of notes for users to indicate that we only accept demultiplexed files, that specific protocols should be indicated under protocol citation or website (see below), and that PE files need to have exact matching file names (except R1 and R2 designations). I can work on this verbiage as the interface comes together.

Extra fields would look something like this:

Fill in all the apply:

sequencer (under DEF we would list sequencing platforms -this would have to be updated occasionally- but would also allow they to type in their own)

Paired end or Single end (check box): If PE selected the upload would require two files

library prep protocol (under DEF we would have recommended types-this would have to be updated on occasion- but allow them to type in their own description) (DEF: Sanger, RADSeq, whole genome, amplicons, transcriptomes)

protocol citation or website (Please indicate reference for specific protocol and/or website)

restriction enzymes (comma separated, if applicable)

size selection or insert size (if applicable)

Let me know if you have any questions.

Best, Michelle