Deconvolution step failing with custom mutation sheet

charlesfoster commented 1 year ago

Hi,

We routinely carry out wastewater analysis, so I thought I'd do my due diligence and check out other pipelines to compare results. I've successfully run pigx-sars-cov-2 on the test data, and on real data using the same test mutation sheet (mutation_sheet_211006_covidCG_NT_location.csv) and bed file (Covid_CG_NTmutation_t07.bed). However, when I run the same data but using a custom set of mutations, the pipeline errors out.

I assume the error might be related to the mutation sheet csv file itself. In the logfile for the deconvolution command I get an odd error:

DECONVOLUTION WITH RLM
Error in { : 
  task 1 failed - "'x' is singular: singular fits are not implemented in 'rlm'"
Calls: deconvolute -> %dopar% -> <Anonymous>
In addition: Warning message:
executing %dopar% sequentially: no parallel backend registered 
Execution halted

I'm not quite sure what the problem with the input file is. I've attached it here: mutations.uniq.csv Basically, I created it using usher_barcodes followed by some wrangling in Python to get it into the format needed by your pipeline. (Note: not sure if this is a good mutation sheet to use, just testing).

If I switch the deconvolution method to 'weighted_svr', the analysis does not seem to error out. It's still running. However, the problem here is that I'm not sure of the relative merits of different deconvolution methods.

Any thoughts?

Thanks

edit: with the analysis finishing, I can see that using the 'usher_barcodes' isn't great, presumably since many lineages share the same mutations because of the hierarchical nature of pango lineages. Do you have any other tips for setting up the mutation sheet instead? I tried downloading mutation lists for lineages of interest from outbreak.info using their API, but the resulting mutations are given as amino acid changes (e.g., S:D614G) rather than as nucleotide coordinates.

jonasfreimuth commented 1 year ago

Hello,

thanks for reporting your issue, nice to see people playing around with the pipeline.

I believe the reason why (weighted_)svr works and rlm doesn't, is due to the edge case of the singular fit for the data. As it says in the error, this is just something rlm (i.e. the robust linear regression, see the corresponding function in the MASS package) can't deal with, but svr (Support vector regression, implemented via the tune function in the e1071 package) doesn't have that restriction.

Could you please clarify, was the error produced on our test data or your own? I am having trouble reproducing it on the test data. With the mutation sheet you provided I instead run into a dimensionality issue, where there are too few detected mutations for all the variants you are trying to quantify. The pipeline attempts to mitigate this by collapsing/deduplicating variants with the same signature, but this can lead to problems of its own when few mutations are found within a sample. Perhaps this is related, but again, I can't reproduce your exact error. For futrure reference, the error you would be getting in that case would look something like this:

Error in { : task 1 failed - "0 (non-NA) cases"
Calls: deconvolute -> %dopar% -> <Anonymous>
In addition: Warning message:
executing %dopar% sequentially: no parallel backend registered 
Execution halted

As for tips for creating the mutation sheet, as you already noticed, the most important thing is the uniqueness of the mutation. Other than that it's hard to say, I think we haven't tested the pipeline on so many variants before...

charlesfoster commented 1 year ago

Apologies, should have been more clear there. The error I reported was with my own real data.

I just ran the pipeline on the sample data using:

the same settings.yaml file as my 'real' data, but obviously changing the paths to point to the sample reads
weighted_rlm
the mutation sheet I attached in the original post

I didn't get the error from my original post, but the pipeline errored out with:

Error in do.call(cbind, lapply(x, is.na)) : 
  variable names are limited to 10000 bytes
Calls: dedupe_sigmut_mat ... [<-.data.frame -> is.na -> is.na.data.frame -> do.call
Execution halted

I was running it from ~/Programs/pigx-ww, with the same subdirectory structure as the Github repo. I then changed directory to ~/Programs/pigx-ww/sarscov2-test/tests and re-ran the same command, and this time I didn't get any errors re: variable names. Seems like there might be a problem when paths are too long.

However, I did get the following error you suggested:

DECONVOLUTION WITH RLM
Error in { : task 1 failed - "0 (non-NA) cases"
Calls: deconvolute -> %dopar% -> <Anonymous>
In addition: Warning message:
executing %dopar% sequentially: no parallel backend registered 
Execution halted

Clearly for further comparisons I'll need to choose a curated set of diagnostic mutations for fewer lineages :)

jonasfreimuth commented 1 year ago

Yes, the error variable names are limited to 10000 bytes is due to too many variants being collapsed into one column name during deduplication of variants with the same mutation signature specific to that sample. That is fixable though, as a matter of fact I already ran into that same problem yesterday and I think I fixed it, I will open a PR and link it here once I have run the tests on it.

BIMSBbioinfo / pigx_sars-cov-2

Deconvolution step failing with custom mutation sheet #155