Tentative plan from the R side

ngiangre commented 7 years ago

We got R to install on our server and, except for installing 'devtools', I think we're good to go on that issue.

The consensus seems that any kind of reproducible, streamlined pre- and postprocessing of idat files is needed. A manual curation is just too time-intensive, wastes money, yields FPs and Fns, and isn't scalable. Instead of re-inventing the wheel, 'gsrc' seems to be a good package for doing that. However, there might be issues with scaling as is needed for this global screening project.

What I'm hoping to do tonight is write a script, based on the vignette, that takes the .idat and .csv files on the server and processes and normalizes and atleast go up to cnv calling (using default methods right now). Hopefully I'll get through assigning of cna type.

Going forward, I think it would be good to estimate projected memory/time once we scale so that we can justify changing code/coding an updated package (this would be good for our write up). Also, it would be good to make note of downstream things e.g. methods that would be good to change in the existing package. Right now, I see making the current 'gsrc' package, making into a new, updated package, parallelizable and more efficient would be a good goal.

I think tomorrow morning we can finalize evaluation and defining whether we just use the existing 'gsrc' package or make a "parallel gsrc" package.

ekarlins commented 7 years ago

Thanks @ngiangre! Any new code or ideas for new code that would accomplish these goals?

ngiangre commented 7 years ago

Yes, I’m going to write up more by tonight but right now-parallelizing e.g. mcapply instead of lapply and adding in code that can take advantage of multiple threads/cores. I haven’t done it myself but I’ve seen it done before. Honestly, maybe taking apart their code to make it more efficient/speedy. I’m going to try to identify exact bottlenecks in their code with some other R packages I recently found. Hope I can get through that tonight…I’ll update with more after I hack through some more.

From: Eric Karlins notifications@github.com Reply-To: NCBI-Hackathons/Global_Screening_Arrays reply@reply.github.com Date: Monday, March 20, 2017 at 7:30 PM To: NCBI-Hackathons/Global_Screening_Arrays Global_Screening_Arrays@noreply.github.com Cc: Nick Giangreco nick.giangreco@gmail.com, Mention mention@noreply.github.com Subject: Re: [NCBI-Hackathons/Global_Screening_Arrays] Tentative plan from the R side (#7)

Thanks @ngiangre! Any new code or ideas for new code that would accomplish these goals?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

ekarlins commented 7 years ago

Sounds good @ngiangre! I look forward to hearing about how it goes!!

ngiangre commented 7 years ago

So I looked into the functions within gsrc and took a look at some time/memory estimates. I couldn't get the GSA data through the existing gsrc workflow because I need a package installed that I can't install right now on the server (I think Anastasia can help install something that's missing). And downloading the data files to my laptop was taking too long. So I just noted where the bottlenecks were (mostly lapplys in many package functions) and what were the steps outlined in the vignette with their data. I didn't write a complete script of this because I saw it wasn't necessary. But I wrote out the vignette and outlined the time/memory commands I used-nothing that should be added to the repo.

Anyways, 'gsrc' has a lot of function hacks when reading in files and changing file formats and annotation, but when calling cnvs it uses a different method than the HMM of PennCNV which might be good to add to our cnv calling comparison methods. I could make an R script that could be added to snakemake for doing that. We can make a list of cnv calling methods later.

But anyways, if we wanted a fully functioning pipeline, I think it would require modifying code in multiple packages and that might not be as feasible as the python approach.

slsevilla commented 7 years ago

I agree with Nick. I started bringing in our data to R and the biggest bottleneck for me was the actual data load. Bringing in the .bmp (as a CSV) also took a while, but since the dictionary would only be created once, not as big of a deal.

I think it would be a piecemeal sort of approach, which may make the project overall helpful.

Here's my visual for the way the project would flow and where I see contributions. I left out the visualizations because that really relies heavily on how we're getting the data out, what visual tools are available. https://www.gliffy.com/go/publish/11882477

ekarlins commented 7 years ago

@ngiangre and @slsevilla, it sounds like you both agree that the best way for us to get something working in the short amount of time that we have is moving forward with the Snakemake workflow. I think adding functionality from "gsrc" to that workflow would be great. I'll start a separate issue for that and close this issue. Actually, I'll start a bunch of issues soon with all that I think we have on our plate to get this pipeline working. If you see others functionality that would be useful, please start separate issues for those.

@mtbrown22, I'm going to assign you some issues having to do with PennCNV. It's been a while since I've used that program so it would take some work on my end to figure it out again or find my old scripts. But definitely let me know if you need help with anything!

ngiangre commented 7 years ago

Sounds great go team!

From: Eric Karlins notifications@github.com Reply-To: NCBI-Hackathons/Global_Screening_Arrays reply@reply.github.com Date: Tuesday, March 21, 2017 at 9:06 AM To: NCBI-Hackathons/Global_Screening_Arrays Global_Screening_Arrays@noreply.github.com Cc: Nick Giangreco nick.giangreco@gmail.com, Mention mention@noreply.github.com Subject: Re: [NCBI-Hackathons/Global_Screening_Arrays] Tentative plan from the R side (#7)

@ngiangre and @slsevilla, it sounds like you both agree that the best way for us to get something working in the short amount of time that we have is moving forward with the Snakemake workflow. I think adding functionality from "gsrc" to that workflow would be great. I'll start a separate issue for that and close this issue. Actually, I'll start a bunch of issues soon with all that I think we have on our plate to get this pipeline working. If you see others functionality that would be useful, please start separate issues for those.

@mtbrown22, I'm going to assign you some issues having to do with PennCNV. It's been a while since I've used that program so it would take some work on my end to figure it out again or find my old scripts. But definitely let me know if you need help with anything!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

NCBI-Hackathons / Scan2CNV

Tentative plan from the R side #7