Open TimothyStiles opened 9 months ago
I think this might combine well with #362 . If we have a tutorial series that takes a real-world example from end to end throughout our package, we can also use it to benchmark performance.
I think it'd be great to benchmark against all of Genbank or uniprot or pdb. Would take a server with decent hard drives, or just a lot of data per month to stream, and would actually validate that our parsers work well.
This is a great idea! We could have our new CI/CD pipeline (#365) incorproate this.
I don't think it'd be advisable to have it run against ALL of these massive datasets every time we merge, but we could have it pick a consistent, representative subset.
It'd be nice to also have all new entries in these DBs run against the latest version of our parsers.
Also, these DBs aren't that big size-wise since it's just text and not image data, right? I have no clue, this is a genuine question.
Genbank I think is a little over a terabyte, so not that bad. Uniprot is like 250gb. SRA, on the other hand, is 33 petabytes (and the wayback machine is 57 petabytes), so kinda puts it into perspective. SRA there is NO WAY we could handle, but Genbank+uniprot would probably be doable.
This issue has had no activity in the past 2 months. Marking as stale
.
It'd be really cool to have a benchmarking suite that we can run to see if we've unintentionally introduced any performance changes before merging into main.
Idea would be that on PR creation we'd run the benchmarks on both the main branch and the PR branch, and use it to highlight any significant changes (positive and negative).
We can start slow with what we'd consider "problem areas" and continue out from there.