OSC / phylogatr-web

The web app for the Phylogatr Project - https://phylogatr.org/
https://phylogatr.org/
MIT License
0 stars 0 forks source link

Expand bin/rake metrics:species to include detailed report statistics #17

Open johrstrom opened 2 years ago

johrstrom commented 2 years ago

Previously we had a script that produced detailed metrics utilizing descriptive_statistics gem to provide stats that Bryan asked for here:

Here are the stats that we'd like.

Breakdown of the number of species represented in each Kingdom, Phylum, and Class. For each of the above, Mean, Medium, Mode on the number of sequences per entry Find species with the largest number of sequences in each of these categories and report that number.

Here, "entry" is a single .fa file. So imagine for Phylum “Foo” there were only 2 species, each with two genes (thus two fa files). Species A having 10,12 sequences, species B having 14,16 sequences. The results would be:

mean: 13 mode: 10 median: 13 species with largest number of sequences in Foo: B

This is a script that generated that (though below currently needs access to the un-tarred gene file data). require 'csv'

records = []

puts Benchmark.measure {
  Occurrence.where(species_aligned: true).distinct.pluck(:taxon_class).each do |taxon_class|
    species = Occurrence.where(taxon_class: taxon_class).where(species_aligned: true).distinct.pluck(:species_path)
    seqs = species.map {|p| Species.new(Configuration.genbank_root.join(p)).unaligned_fasta_sequence_counts }.flatten
    stats = DescriptiveStatistics::Stats.new(seqs.sort)
    species_max = species.map {|p| Species.new(Configuration.genbank_root.join(p)) }.max { |a,b|
      a.unaligned_fasta_sequence_counts.sum <=> b.unaligned_fasta_sequence_counts.sum
    }
    records << [
      "taxon_class",
      taxon_class,
      species.count,
      stats.mean,
      stats.mode,
      stats.median,
      species_max.name,
      species_max.unaligned_fasta_sequence_counts.sum
    ]
  end
}

puts Benchmark.measure {
  Occurrence.where(species_aligned: true).distinct.pluck(:taxon_phylum).each do |taxon_phylum|
    species = Occurrence.where(taxon_phylum: taxon_phylum).where(species_aligned: true).distinct.pluck(:species_path)
    seqs = species.map {|p| Species.new(Configuration.genbank_root.join(p)).unaligned_fasta_sequence_counts }.flatten
    stats = DescriptiveStatistics::Stats.new(seqs.sort)
    species_max = species.map {|p| Species.new(Configuration.genbank_root.join(p)) }.max { |a,b|
      a.unaligned_fasta_sequence_counts.sum <=> b.unaligned_fasta_sequence_counts.sum
    }
    records << [
      "taxon_phylum",
      taxon_phylum,
      species.count,
      stats.mean,
      stats.mode,
      stats.median,
      species_max.name,
      species_max.unaligned_fasta_sequence_counts.sum
    ]
  end
}
puts Benchmark.measure {
  Occurrence.where(species_aligned: true).distinct.pluck(:taxon_kingdom).each do |taxon_kingdom|
    species = Occurrence.where(taxon_kingdom: taxon_kingdom).where(species_aligned: true).distinct.pluck(:species_path)
    seqs = species.map {|p| Species.new(Configuration.genbank_root.join(p)).unaligned_fasta_sequence_counts }.flatten
    stats = DescriptiveStatistics::Stats.new(seqs.sort)
    species_max = species.map {|p| Species.new(Configuration.genbank_root.join(p)) }.max { |a,b|
      a.unaligned_fasta_sequence_counts.sum <=> b.unaligned_fasta_sequence_counts.sum
    }
    records << [
      "taxon_kingdom",
      taxon_kingdom,
      species.count,
      stats.mean,
      stats.mode,
      stats.median,
      species_max.name,
      species_max.unaligned_fasta_sequence_counts.sum
    ]
  end
}

CSV.open("report.csv", "wb") do |csv|
  records.each do |record|
    csv << record
  end
end