[JOSS Review] Paper comment

natir commented 3 years ago

This relates to my JOSS review openjournals/joss-reviews#2991

First thanks for your work on this tools and publications

Some comments about paper:

[x] The interest of have rust-bio and needltail parser in same time isn't clear in publication
[x] Could you add memory usage in benchmark ?
[x] I think it's easier to read result if stats and filter are present in separate Table
[x] Which parser is use by nanoq in benchmark ?
[x] Could you compare rust-bio-stats sequence-stats to nanoq stats functionality is very similar
[x] Compare error rate, read quality, read length, across each filter method, could help to see which method is the best or if we can't see change between methods

esteinig commented 3 years ago

Quick response before working on changes on joss branch:

Good point, I ended up using the Needletail parser as it currently supports fasta/q out of the box. I saw you are working on this for rust-bio for an upcoming release? Do you think it is worth including both?
Yes, absolutely.
Easily done, thanks.
It's the default Needletail parser, I will make that more clear
Ohhh very cool, I was not aware of this! Yep can be done, do you want to see it in the benchmarks as well?
Yep, absolutely - you may have seen the other thread on Filtlong differences as well

Thanks very much!

natir commented 3 years ago

Quick response to some quick response:

1 -> We agree that all you need is a fast·a·q parser ? Needletail support fast·a·q out of box, with gzip, bzip, lzma compression or not, support multi-thread and it's faster than rust-bio. Why did you want keep rust-bio parser ? 5 -> Yes sure !

esteinig commented 3 years ago

That was quick!

Mainly because I really like rust-bio, but I see your point :) I should contribute the filter function to rust-bio-tools instead

natir commented 3 years ago

I like rust-bio too but for your needs isn't the crate you need

esteinig commented 3 years ago

Indeed, needletail is pretty amazing

esteinig commented 2 years ago

@natir my apologies for the long pause, and thanks for your patience. Some uhm interesting times.

I have rewritten the code base and will go through your review after adding tests over the weekend. Thanks again for the suggestions, they helped improve nanoq a lot.

esteinig commented 2 years ago

In regards to your points @natir

[x] removed rust-bio parser in favor of needletail parser
[x] added memory usage to benchmarks; added more detailed information for benchmark replication
[x] separated tables into task and test file benchmarks
[x] only using needletail parser now
[x] added rbt sequence-stats in benchmarks and explanation
[x] compared output between tools to ensure valid computations (+ added better tests)

Nanoq is really fast now (--fast) especially when ignoring quality scores :tada:

Output is compliant between all tools, except that nanoq returns an integer for average read length (which is the correct way to return an average of read length in my opinion, as there are no fractional nucleotide bases in reality)

esteinig commented 2 years ago

Did you want me to check anything else regarding your last point @natir?

natir commented 2 years ago

First thank you for many works you do, nanoq and paper seems many ways better than before.

Yes and I have some other comments, on paper.

In paper you show, nanoq is faster than any other to get stats and memory usage is reasonable and it's nice. But you didn't compare effect of your filter compare to other, just a plot or table to show read length and read quality after filter. A very interesting bonus, could be if we can see filtering improve down stream analysis, like assembly, species detection, variant calling (make your choice).

Another point, when I read section Applications I feel read a user manual, I think a scientific paper isn't the good place for a user manual. nanoq is able work on active nanopore run is very nice, but you can move this information in another section.

You produce a very expensive comparison of nanoq to other tools, and it's nice but a small text to resume this figure and table could be nice. Just few sentence to say nanoq is faster than any other tools and memory usage is low and constant. I also think it's not need to have all condition in main text, you can just let major result in main text and put other in appendix.

Another recommend change could be replace, long and not so easy to read table, by a scatter plot with computation and memory usage as axes. You lost speedup multiplication but you can add it in caption or in main text and you can put some conditions in the same figure.

I think with all this change paper could be very very nice, but if you @bovee or @luizirber think it's too much change I didn't want block publication.

Just to resume and follow decision on my point (If you wont fix a point just check it)

[x] Compare filter result
[x] Remove or move Applications section
[x] Add comment on result
[x] Move some result in appendix
[x] Replace table by scatter plot

natir commented 2 years ago

I also recently discover seqtk fqchk have very similar functionally to nanoq stat. seqtk is a very famous tools compare nanoq to this one could be nice but just cite it is ok too.

esteinig commented 2 years ago

Appreciate the additional comments on this @natir! I would argue that showing filter results is a bit redundant (it's not very exciting to show the same outputs) and downstream improvement of other applications is not really necessary (and a bit too much for this paper).

Overall these are very basic operations done routinely on nanopore data (from my experience). Nanoq is mostly trying to do this faster than other tools. I think a comparison with seqtk fqchk would be great in that context.

Agree with the other points, not sure if scatter plot is the right way to show the results to be honest (the data types don't strike me as appropriate for scatter plots, as they generally show relationships between variables). Perhaps an annotated barplot would be better, and only the full dataset benchmark?

I will make tables and figures more legible, condense the benchmarks, and remove the subset benchmarks to the README and generally brush this more into paper format.

natir commented 2 years ago

For the filter effect it's ok for me.

About usage of scatter plot to show cpu and memory usage, I agree it's not a perfect fits, with the purpose of plot. But it's really interesting, because you can have different condition on same plot (with different marker), and it's easy to spot the best tools because it's in the left bottom corner. But it's your call.

esteinig commented 2 years ago

Thanks for the patience @natir - I made the following changes to the paper:

added seqtk fq
commented on benchmark results
rewrote paper summary
removed application section
revised figures but kept barplots

natir commented 2 years ago

Ok I didn't write all of these in this way but it's your paper, information I think are important is present in paper.

Just a point, author of niffler are Luiz Iber and I.

esteinig commented 2 years ago

Oh sorry for this! Let me add you in the Acknowledgements as well

esteinig commented 2 years ago

Thanks excellent, fixed now. Will close comment but feel free to open or comment if there was anything else

esteinig / nanoq

[JOSS Review] Paper comment #17