Clinical-Genomics / preClinVar

A ClinVar API submission helper written in FastAPI
MIT License
1 stars 0 forks source link

Submission status is "error"... #82

Closed dnil closed 1 year ago

dnil commented 1 year ago

Oopsie with the submission from yesterday: Screenshot 2023-02-01 at 08 03 37

Any chance of seeing that error, or is it lost in a http response somewhere?

This could of course be a Scout error as well, or both, since it flagged as submitted ok in there, but let's start somewhere. 😊

northwestwitch commented 1 year ago

I wasn't watching the repo, sorry, Now I'm aware of all that happens around here.

I was watching the logs from the preClinVar container and this is all 🤔

image

Does it mean that it returns a file when the submission fails? We should return that to the user then!

northwestwitch commented 1 year ago

Perhaps we can check that error by monitoring the submission status, either directly on the ClinVar API or adding the endpoint here or in scout:

The easiest for now would be their API:

Check it here: https://www.ncbi.nlm.nih.gov/clinvar/docs/api_http/

And then go to Submission status

Unfortunately I can't do that because the key is personal. Let me know what it says!

dnil commented 1 year ago

Running

curl -s -D - --header "Content-type: application/json" --header "SP-API-KEY: veryverysecret" "https://submit.ncbi.nlm.nih.gov/api/v1/submissions/SUB12692623/actions/"

gives the reply

HTTP/2 200
strict-transport-security: max-age=31536000; includeSubDomains; preload
content-type: application/json; charset=UTF-8
content-length: 486
set-cookie: ncbi_sid=C806E90960DBD678%5F2CF2SID; Max-Age=31536000; Domain=.nih.gov; Path=/; Secure; $x-enc=URI_ENCODING
x-ua-compatible: IE=Edge
x-xss-protection: 1; mode=block
date: Thu, 02 Feb 2023 07:34:35 GMT
server: Apache

{"actions":[{"id":"SUB12692623-1","targetDb":"clinvar","status":"error","updated":"2023-01-31T15:42:04.015724Z","responses":[{"status":"error","message":{"severity":"error","errorCode":"2","text":"Your ClinVar submission processing status is \"Error\". Please find the details in the file referenced by actions[0].responses[0].files[0].url."},"files":[{"url":"https://submit.ncbi.nlm.nih.gov/api/2.0/files/ehafpkdp/sub12692623-summary-report.json/?format=attachment"}],"objects":[]}]}]}%

The url linked file is this:

{
    "submissionName": "SUB12692621",
    "submissionDate": "2023-01-31",
    "batchProcessingStatus": "Error",
    "batchReleaseStatus": "Not released",
    "totalCount": 2,
    "totalErrors": 2,
    "totalSuccess": 0,
    "totalPublic": 0,
    "submissions": [
        {
            "identifiers": {
                "localID": "132e32467e349705de78dd1d6d5c9523",
                "clinvarLocalKey": "132e32467e349705de78dd1d6d5c9523"
            },
            "processingStatus": "Error",
            "errors": [
                {
                    "input": [
                        {
                            "field": "HGVS.hgvs",
                            "value": ":c.607dup"
                        }
                    ],
                    "output": {
                        "errors": [
                            {
                                "userMessage": "The provided sequence identifier does not have a version."
                            }
                        ]
                    }
                }
            ]
        },
        {
            "identifiers": {
                "localID": "58783f657a44698c5f65d4222198faa2",
                "clinvarLocalKey": "58783f657a44698c5f65d4222198faa2"
            },
            "processingStatus": "Error",
            "errors": [
                {
                    "input": [
                        {
                            "field": "HGVS.hgvs",
                            "value": ":c.347_349del"
                        }
                    ],
                    "output": {
                        "errors": [
                            {
                                "userMessage": "The provided sequence identifier does not have a version."
                            }
                        ]
                    }
                }
            ]
        }
    ]
}

So two things; first one seems to be the transcript doesn't seem to have been concatenated with the c-value for a full HGVS. I'm going to guess it is actually a bug in the preclinvar submission, but there is also a chance we have some weirdness with the scout communication/ui, as the transcript name was retrieved remotely from variantvalidator and then supposed to be passed on.

Secondly, and a bit en passant, I'm confused as to why the ids changed between submission name and actual submission id. Could it be that the value returned from the the 'validate and obtain ID' is not properly passed on to/from preclinvar?

northwestwitch commented 1 year ago

I see. This is good. It's a bug but doesn't look so hard to fix, and we also have a way to implement the status monitor in Scout. I don't know where the error is, but I'll look into it! Thanks!

dnil commented 1 year ago

I think we should move the core issue to Scout: preClinVar doesn't do much with the files. Current prod csvs are broken: Screenshot 2023-02-02 at 09 24 05.

northwestwitch commented 1 year ago

Doesn't look like I'm collecting the ref seq here in this repo: https://github.com/Clinical-Genomics/preClinVar/blob/6a972731f9b437570904c2a1a32240c974882863/preClinVar/file_parser.py#L231

dnil commented 1 year ago

No, but you shouldn't need to - the transcript name is an integral part of the HGVS descriptor!

northwestwitch commented 1 year ago

part of the HGVS descriptor

Right!

northwestwitch commented 1 year ago

The thing is here then: clinvar_var["ref_seq"] = tx_hgvs.split(":")[0]

It's splitting hgvs into 2 things to pass them separately to preclinvar

northwestwitch commented 1 year ago

I'll write a fix!

northwestwitch commented 1 year ago

But wait, I'm still not convinced about this fix in Scout. The thing is Scout should create these 2 submission files. In the Variant submission file the HGVS and the refseq are on separate columns (see schema)

image

Then the files are sent to preClinVar that should combine the fields again to pass a complete HGVS instead.

I think the fix should be done in preClinVar!

northwestwitch commented 1 year ago

And I'll move this issue back to preClinVar

dnil commented 1 year ago

Maybe we should make an issue to clinvar instead. They can't call it "HGVS" if it is only the "c. or g. portion of the nucleotide HGVS expression". 🙄

dnil commented 1 year ago

They are aware, now, it seems, but perhaps were a bit confused at one point:

HGVS expressions

Check that your HGVS expressions are valid with [VariantValidator](https://variantvalidator.org/service/validate/) or [Mutalyzer](https://mutalyzer.nl/).
On the lite spreadsheet template, enter the HGVS expression in the 'HGVS' name column.
On the full spreadsheet template, enter the accession.version number in the 'Reference sequence' column and the c./g. portion of the HGVS expression in the 'HGVS' column.
We only accept NCBI RefSeq accession numbers as the reference sequence due to technical constraints (namely, that we do not have alignment datasets for GenBank accessions).
Do not include the p. HGVS expression in these columns. It may be provided in the 'Alternate designations' column instead.
If you have information on multiple nucleotide changes that result in the same protein change, submit each nucleotide change on a separate row.
[Spreadsheets with examples of valid HGVS expressions that ClinVar accepts and invalid HGVS expressions and corresponding error messages](https://ftp.ncbi.nlm.nih.gov/pub/clinvar/submission_examples/) are available.
northwestwitch commented 1 year ago

They are aware, now, it seems, but perhaps were a bit confused at one point:

HGVS expressions

Check that your HGVS expressions are valid with [VariantValidator](https://variantvalidator.org/service/validate/) or [Mutalyzer](https://mutalyzer.nl/).
On the lite spreadsheet template, enter the HGVS expression in the 'HGVS' name column.
On the full spreadsheet template, enter the accession.version number in the 'Reference sequence' column and the c./g. portion of the HGVS expression in the 'HGVS' column.
We only accept NCBI RefSeq accession numbers as the reference sequence due to technical constraints (namely, that we do not have alignment datasets for GenBank accessions).
Do not include the p. HGVS expression in these columns. It may be provided in the 'Alternate designations' column instead.
If you have information on multiple nucleotide changes that result in the same protein change, submit each nucleotide change on a separate row.
[Spreadsheets with examples of valid HGVS expressions that ClinVar accepts and invalid HGVS expressions and corresponding error messages](https://ftp.ncbi.nlm.nih.gov/pub/clinvar/submission_examples/) are available.

Yeah, it's hard to follow! 😕

dnil commented 1 year ago

Mm, the lite version is slightly newer so I'm going to guess there was a conceptual error - or parsing / documentation issue in the "full" one that they kept for compatibility. You can trace someone wanting to get to the bottom of it here: https://www.ncbi.nlm.nih.gov/clinvar/docs/hgvs_types/. The api appears to use ~full hgvs including transcript name...

dnil commented 1 year ago

And for the record, we now have a success on ClinVar!

Screenshot 2023-02-03 at 13 52 47