Add cache to speedup strand inference

Currently, parallelinferstrand() distributes the workload across N cores using multiprocessing. However, this step can be further optimized if we have at disposal a stored cache containing all the information needed to perform the step.

General idea

Suppose we want to run a pipeline on N different GWAS, but always with the same reference .vcf.gz file.
parallelinferstrand() fetches the data from the .vcf.gz, which is a quite time-consuming operation if it need to be done millions of times (sure, the .tbi index helps a lot the fetching performance, but it's still slow compared to an in-memory approach)
Instead of fetching from the .vcf.gz every time the pipeline is run, we can read the whole reference file once and build a cache (i.e. a Python dictionary) containing only the information needed to perform the check_strend and check_indel steps. This information includes: record.pos, record.ref, record.alts, record.info[ref_alt_freq][0].
When parallelinferstrand() reaches the check_strend and/or check_indel steps, it checks if the cache exists and if it's true, the cache is used.

Actual implementation

The cache will be searched in the same directory where the .vcf.gz file is located, or in the user's cache folder (path obtained using pltaformdirs). The cache will also be written to one of these directories.
gwaslab.cache_manager contains all the classes needed to load, build and manage the cache:
- CacheManager: class that loads/builds the cache when instantiated. It exposes an apply_fn method that can be used to apply an arbitrary function having a cache argument, Optionally, this class can be instantiated with a cache_loader or a cache_process instead of the .vcf.gz path (see below for details)
- CacheBuilder: class that builds the cache by reading the whole .vcf.gz file.
- CacheLoaderThread and CacheLoaderProcess: classes that can be used to load the cache in the background in a separate thread or process, respectively. In both cases, the cache object will result in the main process memory.
- CacheProcess: class that can be used to load the cache in the background in a separate process, keeping the cache in the sub-process memory.

Why load the cache in the background? If our pipeline consists of several steps, it's very likely that the harmonization step is not the first one. So, we could start loading the cache while the pipeline is still running. Of course this approach is advantageous if we have >1 core.

`CacheLoaderThread` vs `CacheLoaderProcess` vs `CacheProcess`

All three classes can be used to load the cache in the background, however:

CacheLoaderThread: in theory, this should be the best and simplest approach, because the cache is loaded directly in the main process memory, using multiple cores. However, due to the GIL (Global Interpreter Lock) in Python, which prevents true multithreading, this approach is not efficient and it slows down the main process on which the rest of the pipeline runs. If and when the GIL is removed (months, years, never?), this may become the preferred approach.
CacheLoaderProcess: it solves the previous problem because processes are not subject to the GIL. However, in order to access the cache object directly from the main process on which the pipeline is running, the processes have to copy the object from their memory to the main one, which wastes time and memory if the cache is large.
CacheProcess: this class aims to solve the problems of the previous two classes, and for this reason it should be considered the preferred approach for now. It still uses processes to load the cache, but it does not copy the cache to the main memory. Instead, it exposes some methods to interact directly with the cache stored in the process memory. In this case, the data needed to interact with the cache is copied to the sub-process memory. For example, if we want to perform some operations that compare the cache with a Sumstats dataframe, it's the dataframe that is the object copied to sub-process memory. \ So: Dataframe (main memory) -- copied to --> sub-process memory, instead of: main memory <-- copied to -- cache (sub-process memory), which can save a lot of time and memory when sizeof(dataframe) << sizeof(cache)

Usage

The only API change is that now parallelinferstrand has a new argument cache_options (which can be passed within the inferstrand_args arguement of harmonize()). Such cache_options argument is a dictionary which may contains the following keys:

cache_manager: CacheManager object or None. If any between cache_loader and cache_process is not None, or use_cache is True, a CacheManager object is automatically created.
trust_cache: bool (optional, default: True). Whether to completely trust the cache or not. Trusting the cache means that any key not found inside the cache will be considered as a missing value even in the VCF file. If False, the VCF file will be fetched for all the missing values in the cache.
cache_loader: Object exposing a get_cache() method or None.
cache_process: Object exposing an apply_fn() method or None.
use_cache: bool (optional, default: False). If any of the cache_manager, cache_loader or cache_process is not None, this is automatically set to True . If set to True and all of cache_manager, cache_loader, and cache_process are None, the cache is loaded (or built) on the fly.

Basic usage:

sumstats.harmonize(..., inferstrand_args={'cache_options': {'use_cache': True}})

This will start loading the cache once the harmonization step reaches the parallelinferstrand method.

Advanced usage (preloading):

Somewhere at the begin of the pipeline (e.g. immediatly after the creation of the sumstats object):

cache_process = gwaslab.cache_manager.CacheProcess(vcf_gz_path, ref_alt_freq=ref_alt_freq, n_cores=NUM_WORKERS, log=sumstats.log, verbose=True)
cache_process.start()

Then, when needed:

sumstats.harmonize(..., inferstrand_args={'cache_options': {'cache_process': cache_process}})

In both basic usage and advanced usage, the cache will be automatically built if not found. So, in practice, there is no need to manually instantiate and use a CacheBuilder object. Also, 99% of the time there is no need to manually create a CacheManager.

N.B. This PR does not change the default behavior, so the cache will not be built and used unless explicitly requested by the settings in cache_options.

Improvement and drawbacks

For a real-world GWAS (and not a toy one) with a real-world VCF file, the parallelinferstrand() step could take 10-15 minutes (depending on how many cores are available). With the cache implementation, it could take up to 3 mins to load the cache, but then the check_strand and check_indels steps take few seconds. These 3 minutes required to load the cache can be "removed" if the loading is done in the background. The drawback of this approach is that, depending on the VCF file, the cache could be result quite big when loaded into memory (in my tests, a cache for 80M variants requires ~40GB of RAM). In the future, we could try to limit this drawback by building a separate cache for each chromosome, and then loading one cache at a time into memory. This will probably slow down the process a bit, since the cache has to be loaded on the fly when it is needed (although you could preload the next cache while the first one is being used).

As you can see from this PR and my previous PRs #80 #81, I'm trying to optimize some steps. This is because I need to run a pipeline on ~10000 GWAS, and since I have a HPC available, I'm not very resource constrained (so the increase in RAM usage is fine). However, all these optimization could save days (if not weeks).

Thanks for your efforts, and I'm available for any clarification or further testing (which, as always, I invite you to do on your end as well) :)

Thanks again for your great efforts! Actually, I was also thinking about how to enhance running a pipeline for multiple datasets. My initial idea was to harmonize one dataset(, imputation panel, or something) and then use the harmonized dataset as a template/reference to harmonize other datasets, which could be fast for both harmonization and assigning rsIDs. (g_SumstatsT.py; not finished though). Other ideas for this include building a hdf5 file for fast access. (util_ex_process_h5.py, currently only for assigning CHR and POS based on rsID). I think these are similar to your idea in some ways.
Since it does not change the default behavior, I would be very glad to merge your implementation for now (after reading your codes and running some tests soon).

Hi Andrea,

I test your codes with a VCF file with 3M records and it worked quite well. Great implementation!!

And I also have some thoughts that we can probably discuss before merging.

redundancy in cache

current cache structure (the information extracted by fetching the records in VCF):
```
{'X:60019:60020': [[60020, 'T', ('TA',), 0.003968250006437302]],
'X:60025:60026': [[60026, 'T', ('C',), 0.0009920640150085092]],
```
I am wondering if we can simply construct SNPID(CHR:POS:REF:ALT)-ALT_AF pairs when iterating through records in VCF to make it slightly more compact (no need to keep the complex structure).
```
{'X:60020:T:TA': 0.003968250006437302,
'X:60026:T:C': 0.0009920640150085092,
```
And then simply use SNPID and ALT_AF for checking in check_strandand check_indels.
split cache based on variant types for parallelinferstrand , we only need palindromic SNPs (A/T, C/G SNPs) and indels, which accounts for a small part of all variants in a reference VCF. But current cache contains all variants. When building cache, I am thinking maybe we can split the cache to two parts (1) non-palindromic SNPs (like xxxx.npsnp.cache) and (2) palindromic SNPs and indels (xxxx.pi.cache).
For parallelinferstrand, only part 2 is needed. Part 1 plus part 2 can be used for parallelecheckaf and paralleleinferaf in a similar way as you implemented for parallelinferstrand.
ref_alt_freq verification for cache Since one VCF may have multiple IDs in INFO, (like the allele frequencies for different ancestry), We need to verify if ref_alt_freq is indeed the ID in INFO used for building cache. Currently when cache exists, changing ref_alt_freq won't have any effect.
slightly more flexible cache format I am wondering if we can use hdf5 format instead of a dictionary pickle for cache since it is more flexible (maybe) and support data slicing. For example, we can use {ref_alt_freq}_{variant_type}_{chr} as keys for groups in HDF5 file.
- ref_alt_freq: EAS_AF, EUR_AF ... (related to point 3)
- variant_type : non-palindromic, palindromic SNPs and indels... (related to point 2)
- chr: 1, 2, ... (related to Improvement and drawbacks in your message)

In each group, the data is just a long table of SNPID(CHR:POS:REF:ALT)-ALT_AF pairs like:

SNPID AF
X:60020:T:TA 0.003968250006437302
X:60026:T:C 0.0009920640150085092

We can then just load the needed parts instead of the entire file to reduce memory usage.

cache-building process took quite a long time for a VCF with 80M records (more than 3 hours with 4 cores), maybe we can slightly improve the speed?

I would like to hear your thoughts on these points. Thanks a lot!

Hi there, I completely agree with all your thoughts! I have pushed some new commits to reflect these suggestions.

At a high level, the API is still the same, but now:

The methods responsible for building the cache have two new arguments: category and filter_fn. category specifies the group name of the cache, such as "pi" for palindromic SNPs and indels, "np" for non-palindromic SNPs, "all" for a complete cache (default: "all"). filter_fn (default: None) can be used to pass a function that takes exactly two arguments, ref and alt, and returns True whether or not the given record should be included in the cache. If category is "pi" or "np", then filter_fn is automatically set to an appropriate function (please let me know if filter_fn_pi() and filter_fn_np() are correctly checking all the needed conditions).
The methods that load the cache take a new argument category (see above).

Following your idea, the cache is now organized in a hierarchical HDF5 file, with the following structure:

.
└── ref_alt_freq (group) (e.g. EAS_AF, EUR_AF)
    └── category (group) (e.g. pi, np, all)
        ├── chrom 1 (group)
        │   ├── keys (dataset) (e.g. 1:60020:T:TA)
        │   └── values (dataset) (e.g. 0.003968250006437302)
        └── chrom 2 (group)
            ├── keys (dataset)
            └── values (dataset)

When needed, the cache is reloaded using ref_alt_freq and category as reference. Then a single dict is created, loading all keys and values from each chromosome.

This approach uses much less memory because the data structure is simpler and there are fewer records.

As for your point (5), it's weird (it took me about 30 minutes on 8 cores). Anyway, I have now migrated the cache building to multiprocessing instead of multithreading, and it is much faster on 8 cores. However, other factors could affect the performance of the building process, such as the storage on which the vcf is stored (e.g. a high performance SSD vs. HDD).

Let me know what you think of these changes!

Thanks a lot for your amazing efforts! The latest commit works well. It is much faster than before (less than 3 minutes to build the cache with 4 cores) and uses much less memory and storage. I think the main function is ready and I will merge it now. Probably we can improve some details later (updating/overwriting cache).

Thank you! Yep, we can improve something in the future, and also add cache support for parallelecheckaf and paralleleinferaf.

Also, I would be very glad if you could update the package on pypi. Thanks again!

Cloufield / gwaslab