Open averagehat opened 8 years ago
Looks good. Here are some suggestions:
I think Infection# was listed as Disease, so I changed its name to Infection#. I added type and subtype as optional fields. As for looking up the country via state/city--yes, that's possible. I can make the country field optional and add "city" and "state" columns that the country would be derived from (#4).
I noticed that one (more?) of the Influenza metadata files has accession entries for each segment (see below). Is this the preferred way to store it, or should I also allow only one accession? It looks like I should maybe make two schemas, one for Dengue and one for Influenza, because they have a lot of differences--but that is really up to you.
SequenceName DatabaseName Sampling Year SamplingDate Country Continent Subtype Acc# HA SegmentHA Acc# MP SegmentMP Acc# NA SegmentNA Acc# NP SegmentNP Acc# NS SegmentNS Acc# PA SegmentPA Acc# PB1 SegmentPB1 Acc# PB2 SegmentPB2
>A/Alabama/01/2015 >A/Alabama/01/2015 2015 US N.America H3N2 EPI_ISL_173217 HA_4_567327 EPI_ISL_173217 MP_7_567322 EPI_ISL_173217 NA_6_567326 EPI_ISL_173217 NP_5_567320 EPI_ISL_173217 NS_8_567321 EPI_ISL_173217 PA_3_567323 EPI_ISL_173217 PB1_2_567325 EPI_ISL_173217 PB2_1_567324
Disease is different, that for Dengue could be DF, DHF1, DHF2, DHF3, and for flu could be severe or mild possibly... So Infection# and Disease options are different.
Yes, influenza is submitted to GenBank by segment, so each segment has its own accession number. This is different in the GISAID database (EpiFlu). Here each virus has a unique number (segments are same).
Got it, thanks for clarifying.
I propose the following schema for the input CSV files:
https://github.com/averagehat/pux-starter-app/blob/sequence-db/schema.md