NBISweden / beacon-api-tests

Compliance tester and test cases for Beacons.
GNU General Public License v3.0
1 stars 1 forks source link

Test data and cases for structural variants (DUP/DEL) #33

Open KyleGao opened 4 years ago

KyleGao commented 4 years ago

The current test data only includes breakpoint rearrangements, the DUP and DEL cases are not included. We would like to also have these test cases for our copy number beacon.

Copy number variants are imprecise DUP/DEL of a large span (usually kbs and mbs). A good example of DUP/DEL in VCF can found on page 11 in the VCF specification (https://samtools.github.io/hts-specs/VCFv4.2.pdf).

MalinAhlberg commented 4 years ago

Thanks for the feedback and the link! There are som tests for DEL (see https://github.com/NBISweden/beacon-api-tests/blob/b0406a023369a97f7180f3585015187e0296b92d/tests/v101/test_counts.py#L198 and below), but you are right that there are none for DUP. We will try to include it in a hopefully soon future!

mbaudis commented 4 years ago

O.k.; here some more issues/comments (the Beacon+ ones are "notes to self"...).

@MalinAhlberg @KyleGao @sdelatorrep

Some notes about the specification tests

INFO: Testing version v101
INFO: *** Running tests from test_datasets
INFO: Testing test_two_datasets
    Test that both datasets repsond.
INFO: Open https://beacon.progenetix.org/query?
  referenceName=22
  referenceBases=TG
  assemblyId=GRCh38
  start=16577043
  end=16577045
  includeDatasetResponses=HIT
  variantType=SNP

There is a case to be made for supporting wildcard scenarios, e.g. by allowing a "SNP" query against a position or range, w/o any specification of referenceBases or alternateBases.

INFO: Testing no_refbases
    Check that queries without referenceBases is not allowed.
INFO: Open https://beacon.progenetix.org/query?
  referenceName=22
  alternateBases=N
  assemblyId=GRCh38
  start=0
  end=2
  includeDatasetResponses=HIT
  datasetIds=GRCh38%3Abeacon_test%3A2030-01-01

This is correct; but the minimum use of a single "N" for structural or wildcard queries as per spec is ambiguous, since the query with "referenceBases=N"can be interpreted as requiring "any referenceBases value of length 1", and would not match e.g. "referenceBases=CG".

INFO: Testing test_snp
    Test variantType SNP.
INFO: Open https://beacon.progenetix.org/query?
  referenceName=22
  referenceBases=C
  assemblyId=GRCh38
  start=17302971
  end=17302972
  includeDatasetResponses=HIT
  datasetIds=GRCh38%3Abeacon_test%3A2030-01-01
  variantType=SNP
INFO: Testing test_bad_end
    Test querying with a bad end position.
INFO: Open https://beacon.progenetix.org/query?
  referenceName=22
  referenceBases=A
  alternateBases=G
  assemblyId=GRCh38
  start=17300407
  end=17300409
  includeDatasetResponses=HIT
  datasetIds=GRCh38%3Abeacon_test%3A2030-01-01

and

INFO: Testing test_end
    Test the same query as `test_bad_end()` but with the correct end position.
...
INFO: Testing test_insertion
    Test variantTypes INS.
INFO: Open https://beacon.progenetix.org/query?
  referenceName=22
  referenceBases=A
  assemblyId=GRCh38
  start=16064512
  end=16064513
  includeDatasetResponses=HIT
  datasetIds=GRCh38%3Abeacon_test%3A2030-01-01
  variantType=INS

This may be a correct use, but is not really documented in the spec. This would be considered a wildcard query, not a structural one, at a precise position.

INFO: Testing test_deletion
    Test variantTypes DEL.
INFO: Open https://beacon.progenetix.org/query?
  referenceName=22
  referenceBases=GACAA
  assemblyId=GRCh38
  startMin=16517679
  startMax=16517680
  endMin=16517684
  endMax=16517684
  includeDatasetResponses=HIT
  datasetIds=GRCh38%3Abeacon_test%3A2030-01-01
  variantType=DEL

and

INFO: Testing test_deletion_2
    Test variantTypes DEL with startMin/startMax.
...
INFO: Testing test_snp_mnp
    Test representation of TG->AG and multiple variations from one vcf line.
INFO: Open https://beacon.progenetix.org/query?
  referenceName=22
  referenceBases=TG
  assemblyId=GRCh38
  start=16577043
  end=16577045
  includeDatasetResponses=HIT
  datasetIds=GRCh38%3Abeacon_test%3A2030-01-01
  variantType=SNP

As above, "SNP" use for wildcard searches? This is not documented (i.e. no required use of explicit variant type "SNP").

(streamlined/clarified some comments in edit 2019-10-29)

viklund commented 4 years ago

Thanks for these comments, @mbaudis!

A lot of the comments are about how to interpret the variantType field in the spec. Of relevance here is that in the vcf file we use there currently aren't any symbolic alternate alleles.

We still think that the variantType is obvious in many cases and therefore we have tested that the beacon can respond to those cases. As we think researchers would be surprised if they didn't get responses otherwise. But maybe this type of translation is the job of a frontend tool to convert a more freeform query to a beacon-api query.

We do find it a little bit confusing to use two different fields (alternateBases and variantType) in the API that map to the same field in the VCF file (ALT) in such a way that only one of them is allowed to be present. Especially since the VCF specification itself mentions different variant types in section 5.2 ("Decoding VCF entries for SNPS and small indels"). But if this is what the specification means the tester should comply with that.

In your first example, do you mean that the beacon should return a 400 bad request response?

As for the usage of the end parameter. We did not interpret the specification in such a way that end is disallowed when both start and referenceBases is used, but maybe this also should return a 400 bad request? Or should it just ignore the end parameter?

And just to make sure that we are on the same page with regards to terminology. When you say "structural query" do you then mean those cases that uses a symbolic alt in the vcf?