Open ctb opened 11 years ago
Quick question: Does the test_extract_partitions() not correspond to the current functionality of extract_partitions.py?
Working on this here: https://github.com/RamRS/khmer/tree/extrpart-tweak
First step, gonna walk through code and add comments - this helps me understand the code. Second, gonna add argparse usage Third, check algorithm consistency with current test Fourth, perform necessary updates
Does this look okay?
On Sun, Sep 01, 2013 at 07:16:23PM -0700, Ram RS wrote:
Working on this here: https://github.com/RamRS/khmer/tree/extrpart-tweak
First step, gonna walk through code and add comments - this helps me understand the code. Second, gonna add argparse usage Third, check algorithm consistency with current test Fourth, perform necessary updates
Does this look okay?
sounds good.
C. Titus Brown, ctb@msu.edu
It would really help (save me a lot of time) if I could see the output of the 'head SRR492065.pe.kak.qc.fq.gz.part' command from here: https://khmer-protocols.readthedocs.org/en/v0.8/metagenomics/3-partition.html
Do you have a sample output? Else I'd have to work through till Step-3...
On Sun, Sep 01, 2013 at 07:22:15PM -0700, Ram RS wrote:
It would really help (save me a lot of time) if I could see the output of the 'head SRR492065.pe.kak.qc.fq.gz.part' command from here: https://khmer-protocols.readthedocs.org/en/v0.8/metagenomics/3-partition.html
Do you have a sample output? Else I'd have to work through till Step-3...
Mmh, not handy. Why not use the file that gets produced in the test?
random-20-a.fa.part
Just run the commands yourself, instead of using the test.
cheers,
C. Titus Brown, ctb@msu.edu
@ctb is this done?
Nope.
This is a pretty old issue. Refactoring was done by @mr-c long back, and IIRC test coverage is 99% too. My (closed) PR on this: https://github.com/ged-lab/khmer/pull/325
Ram
On Mon, Jun 30, 2014 at 9:30 AM, C. Titus Brown notifications@github.com wrote:
Nope.
— Reply to this email directly or view it on GitHub https://github.com/ged-lab/khmer/issues/144#issuecomment-47531307.
Yes, high test coverage is great -- thanks! However, the main body of code is still spaghetti code and needs to be refactored. This can be done much more easily with high test coverage, but still needs to be done. We can close this issue and open a new one focused just on refactoring if you think it's a good idea?
Modularization would be amazing, but from the perspective of the entire project, what would be the increase in reusability - is any code/logic in this file used anywhere else in the project?
I agree, we could close this issue and open one focused on refactoring. IMHO, refactoring will need more discussion than coding.
Ram
On Mon, Jun 30, 2014 at 10:18 AM, C. Titus Brown notifications@github.com wrote:
Yes, high test coverage is great -- thanks! However, the main body of code is still spaghetti code and needs to be refactored. This can be done much more easily with high test coverage, but still needs to be done. We can close this issue and open a new one focused just on refactoring if you think it's a good idea?
— Reply to this email directly or view it on GitHub https://github.com/ged-lab/khmer/issues/144#issuecomment-47536731.
On Mon, Jun 30, 2014 at 07:22:48AM -0700, Ram RS wrote:
Modularization would be amazing, but from the perspective of the entire project, what would be the increase in reusability - is any code/logic in this file used anywhere else in the project?
The code right now is very difficult to understand, which means that there may well be bugs lurking in there. Worse, it means that if we need to change it to add future functionality, we are likely to introduce new bugs. Spaghetti code is an example of a bad code smell:
http://en.wikipedia.org/wiki/Code_smell
and we seek to eradicate that wherever it may lie. Not a terribly urgent issue, though.
I agree, we could close this issue and open one focused on refactoring. IMHO, refactoring will need more discussion than coding.
Than testing, you mean? Refactoring is part of coding :)
What I mean to say is, in my limited vision of the code, modularity would just make the 100 LoC right now into a set of 5 methods with ~20 LoC each, not necessarily making each method reusable. It would look good and be more flexible to future changes, but no immediate gain otherwise. Pretty sure I'm being a bit too nitpicky.
And I was referring to this email discussion of how to refactor (extract methods, introduce modularity and reusability) where "coding" is the actual implementation, which in this case should not take more than an hour.
Either ways, I can take this up.
Ram
On Mon, Jun 30, 2014 at 10:25 AM, C. Titus Brown notifications@github.com wrote:
On Mon, Jun 30, 2014 at 07:22:48AM -0700, Ram RS wrote:
Modularization would be amazing, but from the perspective of the entire project, what would be the increase in reusability - is any code/logic in this file used anywhere else in the project?
The code right now is very difficult to understand, which means that there may well be bugs lurking in there. Worse, it means that if we need to change it to add future functionality, we are likely to introduce new bugs. Spaghetti code is an example of a bad code smell:
http://en.wikipedia.org/wiki/Code_smell
and we seek to eradicate that wherever it may lie. Not a terribly urgent issue, though.
I agree, we could close this issue and open one focused on refactoring. IMHO, refactoring will need more discussion than coding.
Than testing, you mean? Refactoring is part of coding :)
— Reply to this email directly or view it on GitHub https://github.com/ged-lab/khmer/issues/144#issuecomment-47537647.
On Mon, Jun 30, 2014 at 07:48:24AM -0700, Ram RS wrote:
What I mean to say is, in my limited vision of the code, modularity would just make the 100 LoC right now into a set of 5 methods with ~20 LoC each, not necessarily making each method reusable. It would look good and be more flexible to future changes, but no immediate gain otherwise. Pretty sure I'm being a bit too nitpicky.
Improving the code base IS gain. But as you say it's not urgent.
Hey @RamRS, If you want to claim this, please create a new PR and reference this issue. Thanks!
Alrighty. I'll open a WIP PR.
P.S: I was handling this earlier, so I thought I'd continue. I hope you are OK with that. If you are not, please do let me know.
Ram
On Mon, Jun 30, 2014 at 10:59 AM, C. Titus Brown notifications@github.com wrote:
Hey @RamRS https://github.com/RamRS, If you want to claim this, please create a new PR and reference this issue. Thanks!
— Reply to this email directly or view it on GitHub https://github.com/ged-lab/khmer/issues/144#issuecomment-47542165.
On Mon, Jun 30, 2014 at 08:19:42AM -0700, Ram RS wrote:
Alrighty. I'll open a WIP PR.
P.S: I was handling this earlier, so I thought I'd continue. I hope you are OK with that. If you are not, please do let me know.
Fine by me! But you can also ignore it and do something new :)
No longer claimed.
I can take this on if extract-partitions still needs a refactor.
On Mon, Aug 10, 2015 at 11:03:57AM -0700, Jake Fenton wrote:
I can take this on if extract-partitions still needs a refactor.
please!
Looking at the CI coverate ( http://ci.oxli.org/job/khmer-master/418/label=linux/cobertura/_default_/scripts_extract_partitions_py/ ) I'm seeing near-compelte coverage on extract-partitions, so I'm just going to go ahead with the refactor.
On Thu, Aug 13, 2015 at 10:10:13AM -0700, Jake Fenton wrote:
Looking at the CI coverate ( http://ci.oxli.org/job/khmer-master/418/label=linux/cobertura/_default_/scripts_extract_partitions_py/ ) I'm seeing near-compelte coverage on extract-partitions, so I'm just going to go ahead with the refactor.
For testing on this kind of thing, I would like to ask that (prior to merge) we:
(good test coverage is not enough for correctness.)
Okay. Sounds good.
On Fri, Aug 14, 2015, 08:33 C. Titus Brown notifications@github.com wrote:
On Thu, Aug 13, 2015 at 10:10:13AM -0700, Jake Fenton wrote:
Looking at the CI coverate ( http://ci.oxli.org/job/khmer-master/418/label=linux/cobertura/_default_/scripts_extract_partitions_py/ ) I'm seeing near-compelte coverage on extract-partitions, so I'm just going to go ahead with the refactor.
For testing on this kind of thing, I would like to ask that (prior to merge) we:
- identify a few different data sets to run the through do-partition etc. (the ones in data/ are probably ok)
- run extract-partitions, old version
- run extract-partitions, new version
- make sure results are bytewise identical, OR that we understand exactly why they differ.
(good test coverage is not enough for correctness.)
— Reply to this email directly or view it on GitHub https://github.com/dib-lab/khmer/issues/144#issuecomment-131151481.
Ran a modified variant of the "partitioning large datasets" ( https://khmer.readthedocs.org/en/v1.4.1/user/partitioning-big-data.html#running-on-an-example-data-set ) over the 25k Casava reads in /data, using master's extract-partition and the refactored version. Diffing the dist files and the group files show's they're the same.
(modified script here for reference: https://gist.github.com/bocajnotnef/0bd649939c60a7b7ed72)
On Fri, Aug 14, 2015 at 10:34:10AM -0700, Jake Fenton wrote:
Ran a modified variant of the "partitioning large datasets" ( https://khmer.readthedocs.org/en/v1.4.1/user/partitioning-big-data.html#running-on-an-example-data-set ) over the 25k Casava reads in /data, using master's extract-partition and the refactored version. Diffing the dist files and the group files show's they're the same.
cool - can you add some tests (perhaps md5sum?) that flag differences without necessarily storing all the reads?
Wouldn't that require rerunning the full work flow on the data regardless, in order to test the differences?
On Fri, Aug 14, 2015, 11:32 C. Titus Brown notifications@github.com wrote:
On Fri, Aug 14, 2015 at 10:34:10AM -0700, Jake Fenton wrote:
Ran a modified variant of the "partitioning large datasets" ( https://khmer.readthedocs.org/en/v1.4.1/user/partitioning-big-data.html#running-on-an-example-data-set ) over the 25k Casava reads in /data, using master's extract-partition and the refactored version. Diffing the dist files and the group files show's they're the same.
cool - can you add some tests (perhaps md5sum?) that flag differences without necessarily storing all the reads?
— Reply to this email directly or view it on GitHub https://github.com/dib-lab/khmer/issues/144#issuecomment-131203955.
This could be a less often run test that Jenkins does post merge. You could also drop a comment in the relevant scripts inviting developers making changes to run the tests manually.
On Fri, Aug 14, 2015, 12:09 Jake Fenton notifications@github.com wrote:
Wouldn't that require rerunning the full work flow on the data regardless, in order to test the differences?
On Fri, Aug 14, 2015, 11:32 C. Titus Brown notifications@github.com wrote:
On Fri, Aug 14, 2015 at 10:34:10AM -0700, Jake Fenton wrote:
Ran a modified variant of the "partitioning large datasets" (
https://khmer.readthedocs.org/en/v1.4.1/user/partitioning-big-data.html#running-on-an-example-data-set ) over the 25k Casava reads in /data, using master's extract-partition and the refactored version. Diffing the dist files and the group files show's they're the same.
cool - can you add some tests (perhaps md5sum?) that flag differences without necessarily storing all the reads?
— Reply to this email directly or view it on GitHub https://github.com/dib-lab/khmer/issues/144#issuecomment-131203955.
— Reply to this email directly or view it on GitHub https://github.com/dib-lab/khmer/issues/144#issuecomment-131211693.
Michael R. Crusoe: Programmer & Bioinformatician crusoe@ucdavis.edu The lab for Data Intensive Biology; University of California, Davis https://impactstory.org/MichaelRCrusoe http://twitter.com/biocrusoe
On Fri, Aug 14, 2015 at 12:09:46PM -0700, Jake Fenton wrote:
Wouldn't that require rerunning the full work flow on the data regardless, in order to test the differences?
yes, but it's pretty fast on 25k, no?
My impression is that this issue is still not fully resolved @ctb?
This is a really ugly file... apologies in advance.