Closed TeXitoi closed 7 years ago
Nice work! I'm not terribly familiar with this particular benchmark or the code, but it looks mostly reasonably to me. I do think it would be nice to use a better name than CustomHash
though. Maybe NaiveHash
? :P
@BurntSushi Thanks for your rapid feedback. CustomHash renamed to NaiveHash.
Note: there is little output. I'd still like to test if locking stdout achieves any improvement.
@llogiq strange, it's a bit worst using:
diff --git a/src/k_nucleotide.rs b/src/k_nucleotide.rs
index 37efa6d..47c6b2d 100644
--- a/src/k_nucleotide.rs
+++ b/src/k_nucleotide.rs
@@ -118,19 +118,19 @@ fn generate_frequencies(input: &[u8], frame: usize) -> Map {
frequencies
}
-fn print_frequencies(frequencies: &Map, frame: usize) {
+fn print_frequencies<W: std::io::Write>(mut w: W, frequencies: &Map, frame: usize) {
let mut vector: Vec<_> = frequencies.iter().map(|(&code, &count)| (count, code)).collect();
vector.sort();
let total_count = vector.iter().map(|&(count, _)| count).sum::<u32>() as f32;
for &(count, key) in vector.iter().rev() {
- println!("{} {:.3}", key.to_string(frame), (count as f32 * 100.0) / total_count);
+ writeln!(w, "{} {:.3}", key.to_string(frame), (count as f32 * 100.0) / total_count).unwrap();
}
- println!("");
+ writeln!(w, "").unwrap();
}
-fn print_occurrences(frequencies: &Map, occurrence: &'static str) {
- println!("{}\t{}", frequencies[&Code::from_str(occurrence)], occurrence);
+fn print_occurrences<W: std::io::Write>(mut w: W, frequencies: &Map, occurrence: &'static str) {
+ writeln!(w, "{}\t{}", frequencies[&Code::from_str(occurrence)], occurrence).unwrap();
}
fn get_sequence<R: std::io::BufRead>(r: R, key: &str) -> Vec<u8> {
@@ -145,6 +145,8 @@ fn main() {
let stdin = std::io::stdin();
let input = get_sequence(stdin.lock(), ">THREE");
let input = Arc::new(input);
+ let stdout = std::io::stdout();
+ let mut stdout_lock = stdout.lock();
let occ_freqs: Vec<_> = OCCURRENCES.iter().skip(2).map(|&occ| {
let input = input.clone();
@@ -152,12 +154,12 @@ fn main() {
}).collect();
for i in 1..3 {
- print_frequencies(&generate_frequencies(&input, i), i);
+ print_frequencies(&mut stdout_lock, &generate_frequencies(&input, i), i);
}
for &occ in OCCURRENCES.iter().take(2) {
- print_occurrences(&generate_frequencies(&input, occ.len()), occ);
+ print_occurrences(&mut stdout_lock, &generate_frequencies(&input, occ.len()), occ);
}
for (&occ, freq) in OCCURRENCES.iter().skip(2).zip(occ_freqs.into_iter()) {
- print_occurrences(&freq.join().unwrap(), occ);
+ print_occurrences(&mut stdout_lock, &freq.join().unwrap(), occ);
}
}
Also a bit worst with
diff --git a/src/k_nucleotide.rs b/src/k_nucleotide.rs
index 37efa6d..bd9b794 100644
--- a/src/k_nucleotide.rs
+++ b/src/k_nucleotide.rs
@@ -8,6 +8,7 @@ use std::sync::Arc;
use std::thread;
use std::hash::{Hasher, BuildHasherDefault};
use std::collections::HashMap;
+use std::io::Write;
struct NaiveHasher(u64);
impl Default for NaiveHasher {
@@ -118,19 +119,19 @@ fn generate_frequencies(input: &[u8], frame: usize) -> Map {
frequencies
}
-fn print_frequencies(frequencies: &Map, frame: usize) {
+fn print_frequencies(w: &mut std::io::StdoutLock, frequencies: &Map, frame: usize) {
let mut vector: Vec<_> = frequencies.iter().map(|(&code, &count)| (count, code)).collect();
vector.sort();
let total_count = vector.iter().map(|&(count, _)| count).sum::<u32>() as f32;
for &(count, key) in vector.iter().rev() {
- println!("{} {:.3}", key.to_string(frame), (count as f32 * 100.0) / total_count);
+ writeln!(w, "{} {:.3}", key.to_string(frame), (count as f32 * 100.0) / total_count).unwrap();
}
- println!("");
+ writeln!(w, "").unwrap();
}
-fn print_occurrences(frequencies: &Map, occurrence: &'static str) {
- println!("{}\t{}", frequencies[&Code::from_str(occurrence)], occurrence);
+fn print_occurrences(w: &mut std::io::StdoutLock, frequencies: &Map, occurrence: &'static str) {
+ writeln!(w, "{}\t{}", frequencies[&Code::from_str(occurrence)], occurrence).unwrap();
}
fn get_sequence<R: std::io::BufRead>(r: R, key: &str) -> Vec<u8> {
@@ -145,6 +146,8 @@ fn main() {
let stdin = std::io::stdin();
let input = get_sequence(stdin.lock(), ">THREE");
let input = Arc::new(input);
+ let stdout = std::io::stdout();
+ let mut stdout_lock = stdout.lock();
let occ_freqs: Vec<_> = OCCURRENCES.iter().skip(2).map(|&occ| {
let input = input.clone();
@@ -152,12 +155,12 @@ fn main() {
}).collect();
for i in 1..3 {
- print_frequencies(&generate_frequencies(&input, i), i);
+ print_frequencies(&mut stdout_lock, &generate_frequencies(&input, i), i);
}
for &occ in OCCURRENCES.iter().take(2) {
- print_occurrences(&generate_frequencies(&input, occ.len()), occ);
+ print_occurrences(&mut stdout_lock, &generate_frequencies(&input, occ.len()), occ);
}
for (&occ, freq) in OCCURRENCES.iter().skip(2).zip(occ_freqs.into_iter()) {
- print_occurrences(&freq.join().unwrap(), occ);
+ print_occurrences(&mut stdout_lock, &freq.join().unwrap(), occ);
}
}
Interesting. Perhaps use a BufWriter
? It helped with fasta-redux
at least.
@llogiq no change, but that's not so surprising, the output is very small.
In .to_string()
, I'd be interested if indexing a byte string is compiled into the same code (considering that the compiler can easily prove that c
is in bounds.
Also )
I don't get the idea.
to_string
is really not a bottleneck, thus I think simple and readable code is more important.
To everyone watching this project, please review this code. @llogiq @BurntSushi @Veedrac ?
In particular, I'm not found of my hand scheduling "there is 4 CPU, thus launch the 3 more costy jobs in their own thread and do the rest syncronously".
Performances on the output of fasta 25000000:
Thanks for your time.