snakemake workflow for effective genome sizes

daler commented 8 years ago

I wanted to use epic on mm10 and dm6 assemblies, and I saw that in config/genomes.py you are storing dictionaries of chromsizes and effective genome sizes for human. Since I at least had to do these two other genomes, I figured I'd build something to support a wider range of genomes and read lengths.

It's a snakemake workflow to download sequences and chromsizes and calculate effective genome sizes for any genomes supported by UCSC. I put it in the scripts dir for now, but I was thinking it would be convenient to add the chromsizes dir and effective_sizes dir to the package data so the data are installed. Then instead of config/genomes.py checking the lookup dicts, it can read the chromsizes files and effective genome size files. Supporting a new genome would then just mean adding the assembly name to the snakefile and running it, then re-running setup.py sdist.

landscape-bot commented 8 years ago

Repository health decreased by 2% when pulling b1edb90 on daler:master into 22943ca on endrebak:master.

11 new problems were found (including 0 errors and 4 code smells).
1 problem was fixed (including 0 errors and 1 code smell).

coveralls commented 8 years ago

Coverage remained the same at 17.523% when pulling b1edb90d6a7bbbf934f7bca546e40ea0e7087f70 on daler:master into 22943cae5d26273f17951bb400c6c27ff77d27ca on endrebak:master.

endrebak commented 8 years ago

Nice work. I'll look into it tonight. I think your idea sounds better, when adding many genomes, reading them all into a dict might take a sec of startup-time.

1) What do you think of allowing users to enter their own egs on the CL? The jellyfish approach is a bit strict (no mismatches allowed) so the egs becomes somewhat too high and users might know a value that suits their data better.

2) And what do you think of allowing users to enter a fasta on their command line and having epic compute the egs and chrom sizes? I think this is suboptimal, since a user should rather submit the size/egs data to me so it can be added to epic once and for all. But this might have a use case.

Btw, I do not know how to use git collaboratively, but I think that it would be better if you were able to pull my recent changes and add your changes on top of them ("stashing"/"popping"?) in a feature branch. Some regressions are added if you see the landscape-bot.

daler commented 8 years ago

I wasn't worried so much about startup time, just figured it would be one less step to copy the results into a dictionary when supporting a new genome.

It would definitely be useful to specify egs and chromsizes on the CL. That would eliminate the need for genomes.py, but could still fall back on it for defaults. I was actually starting another PR for that but will hold off for now.

I don't think you need anything in the CLI for egs on top of the existing egs script. It took a lot of RAM to run jellyfish, so it would be good to keep it separate as a sort of infrequently-used helper rather than integrated into the tool proper. Plus the jellyfish dependencies get annoying on OSX, which is another reason to keep separate.

Since there are no conflicts, merging this PR will merge these changes into the changes you already have. I'll merge the master branch into this anyway though.

landscape-bot commented 8 years ago

Code quality remained the same when pulling 79ccc13 on daler:master into 22943ca on endrebak:master.

coveralls commented 8 years ago

Coverage remained the same at 17.523% when pulling 79ccc13ce33bfe6dc161dae86b6abdfe8dfa5fa1 on daler:master into 22943cae5d26273f17951bb400c6c27ff77d27ca on endrebak:master.

endrebak commented 8 years ago

Thanks. I would not mind making this an org-repo if that makes things easier for you or you see some other merit in it.

coveralls commented 8 years ago

Coverage remained the same at 17.523% when pulling 79ccc13ce33bfe6dc161dae86b6abdfe8dfa5fa1 on daler:master into 22943cae5d26273f17951bb400c6c27ff77d27ca on endrebak:master.

endrebak commented 8 years ago

One question: for epic to look up the egs, it needs to know the read length. How did you intend to get the read length in this system*? Compute them from the input file or ask the users to give a length on the command line?

*Is estimating the read length from the first 10-1000(?) reads okay?

daler commented 8 years ago

(not sure an org-repo is needed yet)

For read length, how about figure it out from the first 1000 reads, select the closest read length that has has a stored calculated egs, and emit a log message saying what read length was estimated. If the user notices that the read length wasn't correct or is too far off, they can supply an override on the command line.

endrebak commented 8 years ago

I thought something along those lines so then we are in agreement. Thanks.

endrebak commented 8 years ago

Something funny must have happened when you computed the hg38 genome sizes:

epic/scripts/effective_sizes/hg38_36.txt:Effective genome size:  1.35161359177
epic/scripts/effective_sizes/hg38_50.txt:Effective genome size:  1.42136215844
epic/scripts/effective_sizes/hg38_75.txt:Effective genome size:  1.48818043551
epic/scripts/effective_sizes/hg38_100.txt:Effective genome size:  0.244824913134

The problem is most likely that my code, for some reason, calculated that the length of the genome was 1889563403. Will recompute the length here to understand what went wrong.

Can't reproduce here, so I am guessing something went wrong with your download or conversion to fa :)

Thanks again for the script.

daler commented 8 years ago

Thanks, I will re-run too when I get a chance. I seem to remember having memory issues on cluster nodes with this, maybe some truncation was happening. I'll take a closer look.

endrebak commented 8 years ago

No need, we have plenty of ram here. I am running it right now.

On Tuesday, June 28, 2016, Ryan Dale notifications@github.com wrote:

Thanks, I will re-run too when I get a chance. I seem to remember having memory issues on cluster nodes with this, maybe some truncation was happening. I'll take a closer look.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/endrebak/epic/pull/3#issuecomment-229053390, or mute the thread https://github.com/notifications/unsubscribe/AQ9I0vJOBv1-FCIAXrfmX74--nP9ATNVks5qQSWkgaJpZM4Iaa5g .

biocore-ntnu / epic

snakemake workflow for effective genome sizes #3