dib-lab / khmer

In-memory nucleotide sequence k-mer counting, filtering, graph traversal and more
http://khmer.readthedocs.io/
Other
755 stars 296 forks source link

Add good tests for scripts/extract-partitions.py & refactor #144

Open ctb opened 11 years ago

ctb commented 11 years ago

This is a really ugly file... apologies in advance.

RamRS commented 11 years ago

Quick question: Does the test_extract_partitions() not correspond to the current functionality of extract_partitions.py?

RamRS commented 11 years ago

Working on this here: https://github.com/RamRS/khmer/tree/extrpart-tweak

First step, gonna walk through code and add comments - this helps me understand the code. Second, gonna add argparse usage Third, check algorithm consistency with current test Fourth, perform necessary updates

Does this look okay?

ctb commented 11 years ago

On Sun, Sep 01, 2013 at 07:16:23PM -0700, Ram RS wrote:

Working on this here: https://github.com/RamRS/khmer/tree/extrpart-tweak

First step, gonna walk through code and add comments - this helps me understand the code. Second, gonna add argparse usage Third, check algorithm consistency with current test Fourth, perform necessary updates

Does this look okay?

sounds good.

--t

C. Titus Brown, ctb@msu.edu

RamRS commented 11 years ago

It would really help (save me a lot of time) if I could see the output of the 'head SRR492065.pe.kak.qc.fq.gz.part' command from here: https://khmer-protocols.readthedocs.org/en/v0.8/metagenomics/3-partition.html

Do you have a sample output? Else I'd have to work through till Step-3...

ctb commented 11 years ago

On Sun, Sep 01, 2013 at 07:22:15PM -0700, Ram RS wrote:

It would really help (save me a lot of time) if I could see the output of the 'head SRR492065.pe.kak.qc.fq.gz.part' command from here: https://khmer-protocols.readthedocs.org/en/v0.8/metagenomics/3-partition.html

Do you have a sample output? Else I'd have to work through till Step-3...

Mmh, not handy. Why not use the file that gets produced in the test?

random-20-a.fa.part

Just run the commands yourself, instead of using the test.

cheers,

--t

C. Titus Brown, ctb@msu.edu

mr-c commented 10 years ago

@ctb is this done?

ctb commented 10 years ago

Nope.

RamRS commented 10 years ago

This is a pretty old issue. Refactoring was done by @mr-c long back, and IIRC test coverage is 99% too. My (closed) PR on this: https://github.com/ged-lab/khmer/pull/325

Ram

On Mon, Jun 30, 2014 at 9:30 AM, C. Titus Brown notifications@github.com wrote:

Nope.

— Reply to this email directly or view it on GitHub https://github.com/ged-lab/khmer/issues/144#issuecomment-47531307.

mr-c commented 10 years ago

http://ci.ged.msu.edu/job/khmer-master/label=linux/69/cobertura/_default_/scripts_extract_partitions_py/

ctb commented 10 years ago

Yes, high test coverage is great -- thanks! However, the main body of code is still spaghetti code and needs to be refactored. This can be done much more easily with high test coverage, but still needs to be done. We can close this issue and open a new one focused just on refactoring if you think it's a good idea?

RamRS commented 10 years ago

Modularization would be amazing, but from the perspective of the entire project, what would be the increase in reusability - is any code/logic in this file used anywhere else in the project?

I agree, we could close this issue and open one focused on refactoring. IMHO, refactoring will need more discussion than coding.

Ram

On Mon, Jun 30, 2014 at 10:18 AM, C. Titus Brown notifications@github.com wrote:

Yes, high test coverage is great -- thanks! However, the main body of code is still spaghetti code and needs to be refactored. This can be done much more easily with high test coverage, but still needs to be done. We can close this issue and open a new one focused just on refactoring if you think it's a good idea?

— Reply to this email directly or view it on GitHub https://github.com/ged-lab/khmer/issues/144#issuecomment-47536731.

ctb commented 10 years ago

On Mon, Jun 30, 2014 at 07:22:48AM -0700, Ram RS wrote:

Modularization would be amazing, but from the perspective of the entire project, what would be the increase in reusability - is any code/logic in this file used anywhere else in the project?

The code right now is very difficult to understand, which means that there may well be bugs lurking in there. Worse, it means that if we need to change it to add future functionality, we are likely to introduce new bugs. Spaghetti code is an example of a bad code smell:

http://en.wikipedia.org/wiki/Code_smell

and we seek to eradicate that wherever it may lie. Not a terribly urgent issue, though.

I agree, we could close this issue and open one focused on refactoring. IMHO, refactoring will need more discussion than coding.

Than testing, you mean? Refactoring is part of coding :)

RamRS commented 10 years ago

What I mean to say is, in my limited vision of the code, modularity would just make the 100 LoC right now into a set of 5 methods with ~20 LoC each, not necessarily making each method reusable. It would look good and be more flexible to future changes, but no immediate gain otherwise. Pretty sure I'm being a bit too nitpicky.

And I was referring to this email discussion of how to refactor (extract methods, introduce modularity and reusability) where "coding" is the actual implementation, which in this case should not take more than an hour.

Either ways, I can take this up.

Ram

On Mon, Jun 30, 2014 at 10:25 AM, C. Titus Brown notifications@github.com wrote:

On Mon, Jun 30, 2014 at 07:22:48AM -0700, Ram RS wrote:

Modularization would be amazing, but from the perspective of the entire project, what would be the increase in reusability - is any code/logic in this file used anywhere else in the project?

The code right now is very difficult to understand, which means that there may well be bugs lurking in there. Worse, it means that if we need to change it to add future functionality, we are likely to introduce new bugs. Spaghetti code is an example of a bad code smell:

http://en.wikipedia.org/wiki/Code_smell

and we seek to eradicate that wherever it may lie. Not a terribly urgent issue, though.

I agree, we could close this issue and open one focused on refactoring. IMHO, refactoring will need more discussion than coding.

Than testing, you mean? Refactoring is part of coding :)

— Reply to this email directly or view it on GitHub https://github.com/ged-lab/khmer/issues/144#issuecomment-47537647.

ctb commented 10 years ago

On Mon, Jun 30, 2014 at 07:48:24AM -0700, Ram RS wrote:

What I mean to say is, in my limited vision of the code, modularity would just make the 100 LoC right now into a set of 5 methods with ~20 LoC each, not necessarily making each method reusable. It would look good and be more flexible to future changes, but no immediate gain otherwise. Pretty sure I'm being a bit too nitpicky.

Improving the code base IS gain. But as you say it's not urgent.

ctb commented 10 years ago

Hey @RamRS, If you want to claim this, please create a new PR and reference this issue. Thanks!

RamRS commented 10 years ago

Alrighty. I'll open a WIP PR.

P.S: I was handling this earlier, so I thought I'd continue. I hope you are OK with that. If you are not, please do let me know.

Ram

On Mon, Jun 30, 2014 at 10:59 AM, C. Titus Brown notifications@github.com wrote:

Hey @RamRS https://github.com/RamRS, If you want to claim this, please create a new PR and reference this issue. Thanks!

— Reply to this email directly or view it on GitHub https://github.com/ged-lab/khmer/issues/144#issuecomment-47542165.

ctb commented 10 years ago

On Mon, Jun 30, 2014 at 08:19:42AM -0700, Ram RS wrote:

Alrighty. I'll open a WIP PR.

P.S: I was handling this earlier, so I thought I'd continue. I hope you are OK with that. If you are not, please do let me know.

Fine by me! But you can also ignore it and do something new :)

ctb commented 9 years ago

No longer claimed.

bocajnotnef commented 9 years ago

I can take this on if extract-partitions still needs a refactor.

ctb commented 9 years ago

On Mon, Aug 10, 2015 at 11:03:57AM -0700, Jake Fenton wrote:

I can take this on if extract-partitions still needs a refactor.

please!

bocajnotnef commented 9 years ago

Looking at the CI coverate ( http://ci.oxli.org/job/khmer-master/418/label=linux/cobertura/_default_/scripts_extract_partitions_py/ ) I'm seeing near-compelte coverage on extract-partitions, so I'm just going to go ahead with the refactor.

ctb commented 9 years ago

On Thu, Aug 13, 2015 at 10:10:13AM -0700, Jake Fenton wrote:

Looking at the CI coverate ( http://ci.oxli.org/job/khmer-master/418/label=linux/cobertura/_default_/scripts_extract_partitions_py/ ) I'm seeing near-compelte coverage on extract-partitions, so I'm just going to go ahead with the refactor.

For testing on this kind of thing, I would like to ask that (prior to merge) we:

(good test coverage is not enough for correctness.)

bocajnotnef commented 9 years ago

Okay. Sounds good.

On Fri, Aug 14, 2015, 08:33 C. Titus Brown notifications@github.com wrote:

On Thu, Aug 13, 2015 at 10:10:13AM -0700, Jake Fenton wrote:

Looking at the CI coverate ( http://ci.oxli.org/job/khmer-master/418/label=linux/cobertura/_default_/scripts_extract_partitions_py/ ) I'm seeing near-compelte coverage on extract-partitions, so I'm just going to go ahead with the refactor.

For testing on this kind of thing, I would like to ask that (prior to merge) we:

  • identify a few different data sets to run the through do-partition etc. (the ones in data/ are probably ok)
  • run extract-partitions, old version
  • run extract-partitions, new version
  • make sure results are bytewise identical, OR that we understand exactly why they differ.

(good test coverage is not enough for correctness.)

— Reply to this email directly or view it on GitHub https://github.com/dib-lab/khmer/issues/144#issuecomment-131151481.

bocajnotnef commented 9 years ago

Ran a modified variant of the "partitioning large datasets" ( https://khmer.readthedocs.org/en/v1.4.1/user/partitioning-big-data.html#running-on-an-example-data-set ) over the 25k Casava reads in /data, using master's extract-partition and the refactored version. Diffing the dist files and the group files show's they're the same.

bocajnotnef commented 9 years ago

(modified script here for reference: https://gist.github.com/bocajnotnef/0bd649939c60a7b7ed72)

ctb commented 9 years ago

On Fri, Aug 14, 2015 at 10:34:10AM -0700, Jake Fenton wrote:

Ran a modified variant of the "partitioning large datasets" ( https://khmer.readthedocs.org/en/v1.4.1/user/partitioning-big-data.html#running-on-an-example-data-set ) over the 25k Casava reads in /data, using master's extract-partition and the refactored version. Diffing the dist files and the group files show's they're the same.

cool - can you add some tests (perhaps md5sum?) that flag differences without necessarily storing all the reads?

bocajnotnef commented 9 years ago

Wouldn't that require rerunning the full work flow on the data regardless, in order to test the differences?

On Fri, Aug 14, 2015, 11:32 C. Titus Brown notifications@github.com wrote:

On Fri, Aug 14, 2015 at 10:34:10AM -0700, Jake Fenton wrote:

Ran a modified variant of the "partitioning large datasets" ( https://khmer.readthedocs.org/en/v1.4.1/user/partitioning-big-data.html#running-on-an-example-data-set ) over the 25k Casava reads in /data, using master's extract-partition and the refactored version. Diffing the dist files and the group files show's they're the same.

cool - can you add some tests (perhaps md5sum?) that flag differences without necessarily storing all the reads?

— Reply to this email directly or view it on GitHub https://github.com/dib-lab/khmer/issues/144#issuecomment-131203955.

mr-c commented 9 years ago

This could be a less often run test that Jenkins does post merge. You could also drop a comment in the relevant scripts inviting developers making changes to run the tests manually.

On Fri, Aug 14, 2015, 12:09 Jake Fenton notifications@github.com wrote:

Wouldn't that require rerunning the full work flow on the data regardless, in order to test the differences?

On Fri, Aug 14, 2015, 11:32 C. Titus Brown notifications@github.com wrote:

On Fri, Aug 14, 2015 at 10:34:10AM -0700, Jake Fenton wrote:

Ran a modified variant of the "partitioning large datasets" (

https://khmer.readthedocs.org/en/v1.4.1/user/partitioning-big-data.html#running-on-an-example-data-set ) over the 25k Casava reads in /data, using master's extract-partition and the refactored version. Diffing the dist files and the group files show's they're the same.

cool - can you add some tests (perhaps md5sum?) that flag differences without necessarily storing all the reads?

— Reply to this email directly or view it on GitHub https://github.com/dib-lab/khmer/issues/144#issuecomment-131203955.

— Reply to this email directly or view it on GitHub https://github.com/dib-lab/khmer/issues/144#issuecomment-131211693.

Michael R. Crusoe: Programmer & Bioinformatician crusoe@ucdavis.edu The lab for Data Intensive Biology; University of California, Davis https://impactstory.org/MichaelRCrusoe http://twitter.com/biocrusoe

ctb commented 9 years ago

On Fri, Aug 14, 2015 at 12:09:46PM -0700, Jake Fenton wrote:

Wouldn't that require rerunning the full work flow on the data regardless, in order to test the differences?

yes, but it's pretty fast on 25k, no?

standage commented 7 years ago

My impression is that this issue is still not fully resolved @ctb?