Establish sets of reference data

mattjmeier commented 3 years ago

For testing purposes we need to have two relatively small datasets, with corresponding metadata and contrasts files, to test

Need one for TempO-Seq (maybe a subset of the PFAS data published by Andrea)
Need one for RNA-Seq (a good one in terms of toxicological response is Reza's data, but the downside is that it's Ion Torrent. This doesn't particularly matter but it might be better to use some Illumina data).

cookkate commented 3 years ago

My thoughts:

if we want this to be primarily for testing the correctness/functionality of the code, it doesn't matter the origin of the data, as long as it produces the same result every time. It should be as small as possible to test quickly. Ideally we should have a script to run the whole pipeline on both datasets.
if we want this for trialing new methods, it should reflect the majority of the data we use (presumably Illumina) and be a "realistic" size
if we want this as an example dataset for users to download along with the code, it doesn't need to be large in size but should be the methods most people use (again, probably Illumina).

IMO the first criteria is the most important, and maybe #3.

mattjmeier commented 2 years ago

I think this is where we can discuss continuous integration and development of tests.

To set this up we will really need to define dependencies carefully (which is a good thing anyway).

This guide shows a good strategy. This will also help define a VM/container when that time comes too. Their overall workflow can be seen here. More than one way to skin this cat but this seems like a good starting point for the discussion anyway. I don't think we have to use the "tox" tool, but it provides a useful framework.

Furthermore - this example shows how R can be used within workflows. They have a rather different goal there, but we can still use some of that code to help us.

So, I think the overall tasks for the workflow to make continuous integration possible are:

[x] install R dependencies on ubuntu
[x] install python and snakemake on ubuntu
[x] install other system dependencies (Pandoc, STAR, fastp, samtools, etc.)
[x] run tests on push makes sense to me? I think it should be on any branch - seems possible using wildcards?
[ ] define tests for pre-processing temposeq and rnaseq data separately; want to make sure that neither condition is broken. Check a diff for an archived copy of count_table.csv with expected result that should be identical to the test copy.
[x] define tests for checking DEG lists in the same way as above.
[x] later, we can define more tests for other outputs by running other diffs on "truth" files.

I'll try to start getting this done in a new branch... I don't know if this all the best strategy but it's a start!

mattjmeier commented 2 years ago

This also seems useful

mattjmeier commented 2 years ago

And act seems like what we should use to run actions locally. Looks easy to use and should save issues with computational power.

mattjmeier commented 2 years ago

I think to close this issue all we should do is establish a set of test data for RNASeq and create a separate entry in the test matrix for it.

Later we can think about adding in other temposeq platforms or different genomes or whatever, but I will feel good about just having the two main types of analysis tested regularly.

R-ODAF / R-ODAF_Health_Canada

Establish sets of reference data #3