Arcadia-Science / sourmashconsumr

Working with the outputs of sourmash in R
https://arcadia-science.github.io/sourmashconsumr/
Other
21 stars 3 forks source link

Create function to read sourmash signatures into functions #16

Closed taylorreiter closed 1 year ago

taylorreiter commented 1 year ago

Sourmash signatures are json files that contain fracminhash sketches and metadata associated with those sketches. This PR provides a script that can read signatures into a tibble. The test data are signatures of a variety of types and made with different versions of sourmash, so I'm hoping this implementation is fairly robust.

taylorreiter commented 1 year ago

Oh! great question. I think my brain stopped working when I wrote the PR description 😂

Some visualizations (like upset plots here), rarefaction (here), and random forests models (here) rely on the hashes themselves. To get them into R, I've historically written code to convert signatures to CSV files in python. This is a somewhat lossy process (md5sum and other metadata are dropped) and I've found I write a specific format output for each use case (sometimes I include abundances, sometimes the column name is the sample, etc). It's all really messy and I hate it. I've been shying away from writing a json parser in R because I thought it would be complicated (like i've literally avoided writing this function for 3 years). And then I sat down to do it and it's so easy.

So this function is a building block for other visualizations/analyses and I think it will make signature parsing cleaner across projects.