Open reece opened 3 years ago
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
This issue was closed because it has been stalled for 7 days with no activity.
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.
SeqRepo is capable of >1500 queries/second single-threaded with local data. At this rate, sequence fetching is likely to be a small component of overall execution of a typical analysis pipeline.
Optimizing significantly beyond current performance requires loading sequences in memory. However, it's not generally feasible or useful to prefetch all sequences. Current human databases are ~12GB compressed. Prefetching selected sequences on first access could be very beneficial for certain access patterns.
Prefetching might work as follows. The client would be instantiated with a prefetch cache size, which would control the number of sequences in the prefetch cache. The default is 0 (no prefetch).
When a client requests a slice of a sequence, the entire sequence would be read speculatively, anticipating that the next queries might be on the same sequence (e.g., on a single chromosome). Subsequent sequence lookups would be entirely in-memory.
The cache would operate in a typical LRU sense, automatically flushing the sequence least recently accessed if the cache size has reached its target size.
Importantly, prefetching can degrade performance if accesses are not suitably ordered.