BaderLab / GenomeClinic-PGX

Web application for clinical pharmacogenomic interpretation
GNU Lesser General Public License v3.0
9 stars 2 forks source link

VCF Uploader only works for some Sequencing Platforms #97

Closed patmagee closed 9 years ago

patmagee commented 9 years ago

Issues to address with vcf uploads that we have come across

Processing non standard vcf files

  1. Should the the pipeline try to coerce the incoming vcf file into a standard format
  2. How do you handle the fields that are not standardized?
  3. If there are non standard fields How can we differentiate between these fields and potential additional patient within the file. Ie multiple patients per vcf file, followed by an annotation field at the end of this.
  4. Should we encourage a "1 patient per file" policy, where we only try to interpret one single patient. There could be an optional add field that the user would type in the EXACT name of another patient as it appears in the vcf file. they would get a check-mark if this was found within the header.
  5. What other non standard vcf files are there. are these standard on an institutional basis and we can tweek the uploader to reflect each institution?
  6. Are there any scripts that were run previously to normalize these formats.

    variants required by pgx are non standard vcf format

  7. The variants that the pgx uses are required to be in a different format then is the standard vcf file format. By default when there is an insertion / deletion leading base is kept. In the current state the pgx app requires this to be changed to reflect the dbSNP references (A / AC would become - / C). This is not indicative of the original positions called by the variant caller, and may introduce error / data. We should be storing the Original values for these and the original position, then adding a field for modified ref/alt and position. The rsID's correspond to the modified value.
  8. Currently, we are modifying this field ourselves. I am not confident in this technique, and require more unit test to perform this.

    Extra Variant Annotation

  9. At the moment, we are essentially throwing out any non annovar annotation, this can have the side effect of losing a lot of information. Should this data be kept?

    Processing and storing versions of the build.

  10. We are at a transition phase. People are starting to use GRC38/hg38 over GRC37/hg19. How will we address this? the pgx system will have to be adapted in order to reflect the changes in the build versions and position of the reference genomes.
  11. Should we support multiple reference builds and have the user indicate which build to use?
  12. we cannot reliably assume reference build information will be included within the vcf format so there is no way for us to determine really if it is a specific build version.
patmagee commented 9 years ago

It also appears that when there is no additional sample information provided after the final format field then it interprets the entire line of the headers as unique entries

patmagee commented 9 years ago

There should be a check, both on the end of the client when uploading, and on the end of the server for finding the input type ie which vcf format spec is being used.m if files follow standard spec this should be on the first line

patmagee commented 9 years ago

I have fixed the current major issues with the uploader and have moved the remaining issues over to a new issue #105