NCI-DCEG / Flow-IQ

A workflow cloud migration toolkit
0 stars 0 forks source link

Curate test dataset on AWS #1

Open shukwong opened 1 month ago

shukwong commented 1 month ago

Curate test data sets on AWS, preferably on S3 open data, this may include, germline, tumor/normal, genotype, RNASeq data.

shukwong commented 2 weeks ago

Some suggestions for germline variant calling and GWAS genotype sample data:

GIAB AshkenzsimTrio BAMs: https://42basepairs.com/browse/s3/giab/data/AshkenazimTrio/HG002_NA24385_son/NIST_Illumina_2x250bps/novoalign_bams https://42basepairs.com/browse/s3/giab/data/AshkenazimTrio/HG003_NA24149_father/NIST_Illumina_2x250bps/novoalign_bams https://42basepairs.com/browse/s3/giab/data/AshkenazimTrio/HG004_NA24143_mother/NIST_Illumina_2x250bps/novoalign_bams

Truth set for HG002 (son) small variants: https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/AshkenazimTrio/HG002_NA24385_son/chrXY_v1.0/GRCh38/SmallVariant/ Others at: https://www.nist.gov/programs-projects/genome-bottle

GWAS genotype data: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/COXHAP&version=4.1 see https://github.com/andrewhaoyu/multi_ethnic for more details

jaamarks commented 1 week ago

WIP

Tools and types

Here is the list of tools currently included in our flowiq_mapping.json file, along with the data types that each tool can process.

File types

  1. BED (Browser Extensible Data)
  2. BED/BIM/FAM (Plink format)
  3. BCL (binary base call sequence file format)
  4. CRAM (Compressed Reference-oriented Alignment Map)
  5. FASTQ
  6. FASTA, FAI
  7. GFF/GTF (General Feature/Transfer Format)
  8. GVCF (Genomic VCF)
  9. MAP/PED (Plink format
  10. SAM/BAM (Sequence/Binary Alignment Map)
  11. SRA (Sequence Read Archive Normalized/Lite Format)
  12. VCF/BCF (Variant/Binary Call Format)
jaamarks commented 1 week ago

May be useful: https://registry.opendata.aws/

shukwong commented 1 week ago

also this https://github.com/nf-core/test-datasets they are small test datasets and for some tools maybe whole genome would be more useful