quick aggregation of counts and support option from 4-8h to minutes - Githubissues

epigen / atacseq_pipeline

Ultimate ATAC-seq Data Processing, Quantification and Annotation Snakemake Workflow and MrBiomics Module.

https://epigen.github.io/atacseq_pipeline/

MIT License

44 stars 2 forks source link

quick aggregation of counts and support option from 4-8h to minutes #6

Closed sreichl closed 11 months ago

sreichl commented 1 year ago

[x] check if the order is for sure always the same
[x] think of alternatives
[x] speed up aggregation by using a bash 2 liner for aggregate_counts and aggregate_support; needs testing before and compare to my all_counts file (datamash you can install with conda)
```
#Merge
awk 'FNR==1{if (NR==1) print $0; next} {print $0}' ${allFiles} > ${mergeFile}
#Transpose
datamash transpose -t ',' < ${mergeFile} > ${mergeFileTranspose}
```

old notes

add config parameter for quick aggregation of counts and support → by simply concatenating without checking the dimensions/features or simply checking it by hand and throw an error if one sample does not match
update atacseq_analysis.yaml to include pandas version 1.1.4 (because much faster than newer versions) & test before committing changes
pandas 1.3.0 (and the newest pandas 1.3.2) extremely slow compared to pandas 1.1.4 → look for ticket or issue on github that reports this

sreichl commented 1 year ago

make mechanism to configure different queue and memory configs for long jobs eg aggregate steps (maybe not relevant anymore if aggregation is sped up significantly)