Update PR test data for Optimus

kbergin commented 5 years ago

Created from checklist item in [GL-409]

From the ticket that made us realize the tests aren't working as expected: We were working on fixing the bug for while Kishori was out of office and ran into some confusion. We have identified that the PR to merge in the Gene ID change had passing Optimus tests. Ambrose was under the impression that the PR changed both the bam outputs and the count matrix to contain the gene IDs instead of gene names. This could theoretically change the scientific outputs of the pipeline (because gene names and IDs don't map one to one), but the pipeline test passed with no changes to the checksums. This is odd because if the BAM and count matrix now contain IDs for the GE tags instead of names, that's inherently different even if the counting results are the same. This makes us worried that the test is not functioning properly.

AC:

The PR test for Optimus is testing the outputs of our pipeline to confirm they are the same as they previous 'gold standard' run, and run quickly enough. Saman also suggests using the same test data that the DCP integration tests use so that we would have caught this memory issue sooner. We will need to see if that test data will run quickly enough / is good for a PR test, I suspect it is too large.
Note that the PR test seems to be using a different set of references[ |https://github.com/HumanCellAtlas/pipeline-tools/blob/v0.52.0/adapter_pipelines/Optimus/adapter_example_static.json]than the DCP integration test and we should use the DCP integration test set for both.
NEW AC: [~shpilker] noticed that the Optimus dependencies look weird, confirm that the dependencies are correct or update them in this ticket.

From Polina’s ticket:

Why

Today while updating the PR tests’ dependencies to not crash on my PR, I went to update the scientific tests’ dependencies only to notice there were several tasks not listed there. These seem like vaguely important tasks, such as Zarr Utils. This seems like it may be an issue and should be looked into, as it may be or may in the future cause scientific tests to fail.

Where to start

skylab/test/optimus/scientific/dependencies.json

How do we know it’s done

Either the dependencies are nearly identical or we have reasoning why they are not.

┆Issue is synchronized with this Jira Dev Task

kbergin commented 5 years ago

➤ Nick Barkas commented:

Datasets:

Input fastq with reads that are expected to align in the first 40Mb of chr21. This is derived from the 10X 4k PBMC dataset: https://console.cloud.google.com/storage/browser/hca-dcp-sc-pipelines-test-data/smallDatasets/chemistry_X10_V2/pbmc4k_chr21_40Mb/?project=broad-dsde-mint-dev ( https://console.cloud.google.com/storage/browser/hca-dcp-sc-pipelines-test-data/smallDatasets/chemistry_X10_V2/pbmc4k_chr21_40Mb/?project=broad-dsde-mint-dev )

Alternative bigger dataset with the entire chr21: https://console.cloud.google.com/storage/browser/hca-dcp-sc-pipelines-test-data/smallDatasets/chemistry_X10_V2/pbmc4k_chr21/?project=broad-dsde-mint-dev ( https://console.cloud.google.com/storage/browser/hca-dcp-sc-pipelines-test-data/smallDatasets/chemistry_X10_V2/pbmc4k_chr21/?project=broad-dsde-mint-dev )

I would suggest you try both because with the recent changes to accelerate the pipeline even the second one might be fast enough for a routine run

kbergin commented 5 years ago

➤ Nick Barkas commented:

I’m working on the nb-gl416-optimus-tests branch (PR: https://github.com/HumanCellAtlas/skylab/pull/234 ( https://github.com/HumanCellAtlas/skylab/pull/234 )).

I need to note that the scientific tests are not used at all (and are very out of data) and their presence if very misleading ([~shpilker] found the issue of mismatching references). This is not something we should fix in this ticket, but we need to discuss the need for scientific tests and how we do them.

I’ve started by working on the problem that not all output files are tested and I have added md5 checksum tests for these as well. This would have picked the problem of changing output gene names highlighted above.

The only thing that is currently not tested is emptyDrops as testing the output doesn’t make sense (its not scientifically valid and also stochastic). I will tackle this last.

The next problem is that the input dataset is in not suitable (that is also highlighted in the title). I have consulted with other team members and it seems an upper limit on acceptable testing time is around 20min. I will try different datasets (including the ones I have listed above) to see their runtime. If none of these work, I will use the pipeline that allows selecting given intervals to generate intervals form a the gene annotation (so we only process reads from genic regions), that will be both compact and give a fast run time.

kbergin commented 5 years ago

➤ Nick Barkas commented:

I’ve updated the references to match the integration tests. Even with the current datasets this now gives a non-empty output matrix and I therefore updated the hashes. I also update the small pipeline that allows subsetting a fastq on the basis of a bam by location to accept a bed file. This is in this PR:

https://github.com/HumanCellAtlas/skylab/pull/235 ( https://github.com/HumanCellAtlas/skylab/pull/235 )

I have used this to run a datasets where I am only picking up gene reads from chromosome 1. This is currently running here:

https://job-manager.caas-prod.broadinstitute.org/jobs/0883bb08-3504-482a-94cb-670e1069aadc ( https://job-manager.caas-prod.broadinstitute.org/jobs/0883bb08-3504-482a-94cb-670e1069aadc )

kbergin commented 5 years ago

That’s awesome news! Funny that it was the reference all along.

kbergin commented 5 years ago

➤ Nick Barkas commented:

It is the reference + the data. The reference was only chr21 and the data 200k random reads so to see any counts we had to hit a genes in chr21 in those 200k reads, which was very unlikely.

kbergin commented 5 years ago

➤ Nick Barkas commented:

The run I put for subsetting datasets didn’t work because that WDL had issues with memory configuration in the sam sorting step. I fixed and resubmitted with caching use on.

https://job-manager.caas-prod.broadinstitute.org/jobs/d3d23dec-df6c-4fa9-b9bb-2418d9daf427?q=id%3Dd3d23dec-df6c-4fa9-b9bb-2418d9daf427 ( https://job-manager.caas-prod.broadinstitute.org/jobs/d3d23dec-df6c-4fa9-b9bb-2418d9daf427?q=id%3Dd3d23dec-df6c-4fa9-b9bb-2418d9daf427 )

kbergin commented 5 years ago

➤ Nick Barkas commented:

It’s fixed, but the test still takes too long, I’ve halved the input dataset to only one of the input files and re-running.

kbergin commented 5 years ago

➤ Nick Barkas commented:

After consulting with Kylee, I have refactored the Optimus PR tests. I have converted the ‘checker’ into a workflow and I have split individual tasks into WDL tasks. This was necessary because in order to do proper checks on the matrix a different image is required than to check the BAM and a single image would be too big. This way we can also re-use existing images instead of building new ones.

In the process I have also added what is hopefully a more stable BAM checksum (I have kept the old one for now). The stable checksum only looks for primary reads and removed all extra field tags (that vary due to UMI tools).

Splitting the tasks into different steps also allows to use smaller images (ubuntu:18.04) and to also reduce the memory footrprint and I have done both of these things.

The next step is to write a matrix test that reads in the matrix and compared the actual contents and not the md5 checksum.

kbergin commented 5 years ago

➤ Nick Barkas commented:

I have updated the tests and they are now at a stable point, where they all pass and check different outputs individually.

Still have to do the following:

Change matrix checking so that it checks contents in R
Confirm that the tests (especcially the metrics) are ok between non-cached runs. For metrics we have never investigated this. It might be irreproducible. If it is I will convert the tests failures to warnings and make separete ticket fo r this
Merge master update into branch

kbergin commented 5 years ago

➤ Nick Barkas commented:

Matrix checking has been changed
- The tests work fine on non-cached runs
- The metrics are stochastic in ways I don’t understand, I’ve made a separate ticket to look into this
- The Optimus portability tests don’t work, I’ve discussed with Marcus fixing this, its probably the disk space
- I’ve made tickets for all remaining issues
- I have asked Phil for a high-level engineering review

kbergin commented 5 years ago

Great thank you for the updates!!

kbergin commented 5 years ago

➤ Nick Barkas commented:

This is done, waiting for CB review

kbergin commented 5 years ago

➤ Nick Barkas commented:

Jessica helped put a small PR in skylab master that disables the ‘build’ step and allows this PR to be merged.

CB approval OK

Portability tests are currently failing due to AWS issue (confirmed with Marcus). Will re-run over the weekend to see if transient.

I pinged Rex and Saman in case it affect them because of extensive changes to test.

Will merge Monday to avoid things being broken over the weekend if something goes wrong.

kbergin commented 5 years ago

➤ Nick Barkas commented:

QA Instructions: Break optimus in a way that changes the results (e.g. change input dataset) and make sure tests fail

kbergin commented 5 years ago

➤ Jessica Way commented:

Changed the values for the "expected_bam_hash", "expected_cell_metric_hash", and "expected_gene_metric_hash" to nonsense. Tests failed as expected. Failing test here ( https://job-manager.caas-prod.broadinstitute.org/jobs/bfea83d1-19e9-4d53-8dc9-3a862f82e0a0?q=id%3Dbfea83d1-19e9-4d53-8dc9-3a862f82e0a0 ).

Note that the error message reads:

Bam Validation: FAIL Metrics Validation: FAIL --- Ignoring failed test --- Matrix Validation: PASSWhich is the expected output. (Metrics files are ignored because the output is different each time.)

HumanCellAtlas / secondary-analysis

Update PR test data for Optimus #706