Further filtering/cleaning and unification of results in iriscouch for the benchmark

brainstorm commented 10 years ago

Now that the reporting functionality works in single thread for all programs, we should put some effort into normalizing/unifying the data reported to the CouchDB database and plot accordingly via matplotlib or D3.

I have disabled the dummy fastq files generation since what we want is only simNGS-generated reads (i.e: simngs_hostORG_contamORG_numREADS.fastq) for the different species with a single ecoli read spiked in there (for instance).

@guillermo-carrasco, @b97pla. Let me know if this sketching needs some more clarification, I will be happy to fit the code to the needs.

guillermo-carrasco commented 10 years ago

Hej @brainstorm ,

The schema looks good. Just one thing that is not clear for me: If the resulting FASTQ file is gonna be named simngs_hostORG_contamORG_numREADS.fastq, I guess that this means that you're going to spike a read of contamORG, which will not always be E.Coli, right?

I rather prefer to plot using matplotlib than D3, as I have some experience with matplotlib, but none with D3. We can talk about when and how are you going to run the tests, so I can push myself with the plotting scripts.

Thanks Roman!

brainstorm commented 10 years ago

Yeah, it definitely would use some generalization, but let's stick with ecoli for now, we already have quite a few moving parts in the tests, I prefer a bit of convention over configuration for now :-/

arvestad commented 10 years ago

Can you please explain what you are trying to do here?

On 28 Oct 2013, at 10:59, Roman Valls Guimerà notifications@github.com wrote:

Yeah, it definitely would use some generalization, but let's stick with ecoli for now, we already have quite a few moving parts in the tests, I prefer a bit of convention over configuration for now :-/

— Reply to this email directly or view it on GitHub.

brainstorm commented 10 years ago

@arvestad, the idea is to generate N synthetic reads from each supported organism (dm3, hg19, ecoli) and spike 1 single read there to see how and if the different decontamination programs detect it.

The way we do now is to generate 100 reads of dm3 and spike a single ecoli read in there. Then, run deconseq, facs and fastq_screen against that and see how it goes.

@guillermo-carrasco was suggesting generalizing the spike part, so instead of going for just ecoli, we could spike an arbitrary read from any organism.

When it comes to benchmarking, for now, I think it makes sense to assume just a single read of ecoli and see how it goes. Then, refactor the tests so that they can spike vad som helst.

Hope that clarifies the issue.

arvestad commented 10 years ago

What I mean is: is your plan a fully reproducible paper, so you plan to have the output become a part of the paper (which is slowly moving forward, I promise) or is this part of the general test set?

On 28 Oct 2013, at 13:02, Roman Valls Guimerà notifications@github.com wrote:

@arvestad, the idea is to generate N synthetic reads from each supported organism (dm3, hg19, ecoli) and spike 1 single read there to see how and if the different decontamination programs detect it.

The way we do now is to generate 100 reads of dm3 and spike a single ecoli read in there. Then, run deconseq, facs and fastq_screen against that and see how it goes.

@guillermo-carrasco was suggesting generalizing the spike part, so instead of going for just ecoli, we could spike an arbitrary read from any organism.

When it comes to benchmarking, for now, I think it makes sense to assume just a single read of ecoli and see how it goes. Then, refactor the tests so that they can spike vad som helst.

Hope that clarifies the issue.

— Reply to this email directly or view it on GitHub.

brainstorm commented 10 years ago

Well, both actually... The final output from these tests will be plots, so yes, that will end up in the paper and be reproducible by capable hands.

The "intermediate" output is that all the tests within FACS run successfully with the fetched test sets.

tzcoolman commented 10 years ago

@brainstorm @arvestad

generating 100 reads of dm3 and spike a single ecoli read in there Will this single ecoli read be generated by simNGS as well? Have you thought about random error control? What I am saying is that if you have 101 reads generated by simNGS (100 dm3 reads and 1 ecoli read), and you use ecoli genome as reference and try to capture the single ecoli read, then the result could be very random due to the random errors that simNGS put in the ecoli read. In that case, any situation could happen, and the result would not be convincing I guess.

brainstorm commented 10 years ago

@tzcoolman, nope, but patches and pullrequests are welcome ;)

guillermo-carrasco commented 10 years ago

Actually now that I read this again, the Ecoli read that we plan to spike in is not generated by SimNGS. It is the one in helpers.py, so it shouldn't be any difference on the results right?

brainstorm commented 10 years ago

It depends on the other generated reads... If they happen to be similar to the Ecoli one, not (quite unlikely though) Den 6 nov 2013 09:16 skrev "Guillermo Carrasco" notifications@github.com:

Actually now that I read this again, the Ecoli read that we plan to spike in is not generated by SimNGS. It is the one in helpers.pyhttps://github.com/guillermo-carrasco/facs/blob/master/facs/utils/helpers.py#L16, so it shouldn't be any difference on the results right?

— Reply to this email directly or view it on GitHubhttps://github.com/SciLifeLab/facs/issues/77#issuecomment-27850319 .

guillermo-carrasco commented 10 years ago

Hmm ok, one should be able to parametrise SimNGS in such a way that it does not introduce any random error, would it be a better solution instead of spiking he same Ecoli read everywhere? I think that it would be kind of the same and is something that is expected -- maybe the second one is a more elegant solution though).

brainstorm commented 10 years ago

We have preliminary numbers on the following PR:

https://github.com/guillermo-carrasco/facs/pull/1

With relevant code here (filtering out merge commits):

https://github.com/brainstorm/facs/blob/8a5ccd5dd283dcf171cdbfeb5d896973da54ba8c/facs/utils/performance.py

Next up: plotting :)

brainstorm commented 10 years ago

[x] Define different read sizes in simNGS generation (i.e: simngs_phiX_100.fastq, 1000, 10000).
[x] Purge reports database (http://facs.iriscouch.com/_utils/), results are tainted by development one-off tests.
[x] Re-run make benchmarks with current test code.
[x] Generate a fixed and predictable amount of reads at simNGS output (see http://www.biostars.org/p/8752/#88128).
[x] Use a static seed when calling simLibrary and simNGS for reproducibility (-seed parameter on both programs).
[x] CouchDB views for each organism.
[x] Have FACS report full paths to samples (... facs/tests/data/synthetic_fastq/simngs_phiX_100.fastq instead basename).
[x] Plot!

brainstorm commented 10 years ago

Automatic plots implemented in the ipython notebook!

SciLifeLab / facs

Further filtering/cleaning and unification of results in iriscouch for the benchmark #77