SACGF / variantgrid

VariantGrid public repo
Other
23 stars 2 forks source link

Claire Seurat / combined caller data VCFs #312

Open davmlaw opened 3 years ago

davmlaw commented 3 years ago

Claire uploaded some VCFs of type we haven't seen before

https://rollbar.com/jimmy.andrews/VariantGrid/items/2457/?utm_campaign=occurrence_message&utm_medium=slack&utm_source=rollbar-notification

VCF Error: Couldn't determine allele depth format field, source: 'combine_caller_data', formats: {'MAX', 'SH', 'SK', 'VC', 'SS', 'SR', 'CB'}

Another one failed due to:

https://rollbar.com/jimmy.andrews/VariantGrid/items/2458/?utm_campaign=occurrence_message&utm_medium=slack&utm_source=rollbar-notification

UploadPipeline 1493 failed. Filename: VCF Error: null value in column "variant_id" violates not-null constraint
DETAIL:  Failing row contains (740278211, U, {-1}, {-1}, null, 729, 0, 0, {-1}, {-1}, {-1}, 0, null).
CONTEXT:  COPY snpdb_cohortgenotype_collection_729, line 99: "729,,,0,0,0,U,{-1},{-1},{-1},{-1},{-1}"

Will need to hotfix vg.com to get it to upload

https://variantgrid.com/upload/view_upload_pipeline/1493 https://variantgrid.com/upload/view_upload_pipeline/1494

davmlaw commented 3 years ago

The combined caller data - we can't really support this yet until we have the ability to import arbitrary VCF fields.

##FORMAT=<ID=VC,Number=1,Type=Integer,Description="Somatic variant caller count">
##FORMAT=<ID=CB,Number=1,Type=String,Description="Variant was called by these callers">
##FORMAT=<ID=MAX,Number=1,Type=String,Description="Greatest somatic score from all callers">
##FORMAT=<ID=SR,Number=2,Type=String,Description="Somatic call by Seurat (judgement, QUAL)">
##FORMAT=<ID=SH,Number=2,Type=String,Description="Somatic call by Shimmer (judgement, QUAL)">
##FORMAT=<ID=SS,Number=2,Type=String,Description="Somatic call by SomaticSniper (judgement, SSC)">
##FORMAT=<ID=SK,Number=2,Type=String,Description="Somatic call by Strelka (judgement, QSS_NT)">
FORMAT  T1_v_N1
VC:CB:MAX:SR:SH:SS:SK   2:SR-SS:39.00:1,39.0:0,0:1,51:0,12.00

Might just have to import that as a no genotype VCF

The other file has similar different values

##fileformat=VCFv4.1
##source=Seurat-2.6
##seuratarguments=snv_alpha=1;snv_beta=700;expected_insert_size=1500;min_event_quality=10.0;maximum_mismatches=3
##INFO=<ID=TYPE,Number=1,Type=String,Description="The type of somatic change detected">
##INFO=<ID=PILEUP1,Number=1,Type=String,Description="The pileup for the normal">
##INFO=<ID=PILEUP2,Number=1,Type=String,Description="The pileup for the tumor">
##INFO=<ID=AR1,Number=1,Type=Float,Description="Allele frequency of ALT allele in normal">
##INFO=<ID=AR2,Number=1,Type=Float,Description="Allele frequency of ALT allele in tumor">
##INFO=<ID=DP1,Number=1,Type=Integer,Description="The depth of coverage in normal">
##INFO=<ID=DP2,Number=1,Type=Integer,Description="The depth of coverage in tumor">
##INFO=<ID=SEQ,Number=1,Type=String,Description="The bases inserted">
##INFO=<ID=LN,Number=1,Type=Integer,Description="The length of a change">
##INFO=<ID=MVC1,Number=1,Type=String,Description="The median for the variant evidence distance from the end of the read (normal)">
##INFO=<ID=MVBQ1,Number=1,Type=String,Description="The median for the base quality of variant evidence in the normal sample">
##INFO=<ID=MVMQ1,Number=1,Type=String,Description="The median for the mapping quality of variant evidence in the normal sample">
##INFO=<ID=MVC2,Number=1,Type=String,Description="The median for the variant evidence distance from the end of the read (tumor)">
##INFO=<ID=MVBQ2,Number=1,Type=String,Description="The median for the base quality of variant evidence in the tumor sample">
##INFO=<ID=MVMQ2,Number=1,Type=String,Description="The median for the mapping quality of variant evidence in the tumor sample">
davmlaw commented 3 years ago

Emailed Claire saying we don't support these VCF formats yet - could potentially just import the positions but not all the INFOs - at least until we do #41

davmlaw commented 3 years ago

Claire says ok to just import them without any sample info

When we're trying to work out whether to use Genotype or NoGenotype importer - we should always fall back to NoGenotype if we can't get the fields we're after

davmlaw commented 3 years ago

I removed the sample column from the VCFs -

bcftools view SVC_RUNX1_singlepass_T1_v_N1_somatic.combined.snvs.HC.vcf.gz --drop-genotypes

The combined works, but the Seurat still fails (including on VG test):

null value in column "variant_id" violates not-null constraint
DETAIL:  Failing row contains (637895491, U, null, null, null, 765, 0, 0, null, null, null, 0, null, null, 0).
CONTEXT:  COPY snpdb_cohortgenotype_collection_765, line 99: "765,,,0,0,0,0,U,,,,,,"

Traceback (most recent call last):
  File "/mnt/variantgrid/upload/tasks/vcf/import_vcf_step_task.py", line 69, in run
    items_processed = self.process_items(upload_step)
  File "/mnt/variantgrid/upload/tasks/vcf/import_sql_copy_task.py", line 11, in process_items
    return sql_copy_files.cohort_genotype_sql_copy_csv(input_filename, table_name)
  File "/mnt/variantgrid/upload/vcf/sql_copy_files.py", line 76, in cohort_genotype_sql_copy_csv
    return sql_copy_csv(input_filename, table_name, COHORT_GENOTYPE_HEADER)
  File "/mnt/variantgrid/upload/vcf/sql_copy_files.py", line 33, in sql_copy_csv
    return sql_copy_csv_file(f, table_name, columns, delimiter, quote=quote)
  File "/mnt/variantgrid/upload/vcf/sql_copy_files.py", line 62, in sql_copy_csv_file
    raise e
  File "/mnt/variantgrid/upload/vcf/sql_copy_files.py", line 49, in sql_copy_csv_file
    value = cursor.copy_from(f,
psycopg2.errors.NotNullViolation: null value in column "variant_id" violates not-null constraint
DETAIL:  Failing row contains (637895491, U, null, null, null, 765, 0, 0, null, null, null, 0, null, null, 0).
CONTEXT:  COPY snpdb_cohortgenotype_collection_765, line 99: "765,,,0,0,0,0,U,,,,,,"