brentp / somalier

fast sample-swap and relatedness checks on BAMs/CRAMs/VCFs/GVCFs... "like damn that is one smart wine guy"
MIT License
254 stars 35 forks source link

`Argument list too long' on 30k samples #37

Closed asazonov closed 4 years ago

asazonov commented 4 years ago

Hi,

Found a small corner-case where the current command for running relate becomes an issue. When calculating the relatedness on around 30.5k sketches the command argument becomes too long and triggers a bash error: `Argument list too long'. My understanding is that the star-syntax unfolds the full lists of paths, which exceeds some OS-specific max length.

I was able to solve this by calling the program from within the sketch path (which shortens the unfolded command): > ./somalier relate ./somalier_sketches/*.somalier -o output > ../somalier relate *.somalier -o output

This won't be a remedy if the number of samples is even larger. [1] suggests this is fixable by bumping up ARG_MAX and ulimit but I wasn't able to verify it yet (plus, probably not an option on every managed cluster).

Thought I should report back, though not really a core software issue. Thank you for writing the tool, it is incredibly useful! Can confirm that somalier scales up to at least 24k samples, though most of the time is spent writing the 30GB .pairs.tsv file.

Alex

[1] https://unix.stackexchange.com/questions/45583/argument-list-too-long-how-do-i-deal-with-it-without-changing-my-command

brentp commented 4 years ago

i wondered when someone would hit this. i'll make is so that any file given with a .list extension is assumed to be a line-delimited list of samples. i'll also avoid writing samples that are unrelated and expected to be unrelated to the pairs file if n_samples > some threshold.

thanks for reporting.

brentp commented 4 years ago

this is fixed in dev and here is a binary in case you want to test.

just use somalier relate ... "/path/to/*.somalier"-- quoting the glob so the shell does not expand it.

somalier.gz

edit: I also update to sub-sample the pairs.tsv file in large cohorts. you are right that this does make somalier even faster and makes the pairs output more manageable.