EddyRivasLab / easel

Sequence analysis library used by Eddy/Rivas lab code
Other
46 stars 26 forks source link

A cryptic error message: "external sort of primary keys failed" #69

Closed Khalimat closed 1 year ago

Khalimat commented 1 year ago

Hi Sean,

Thank you for the library!

I am getting a very cryptic message when I try to index a protein fasta file from IMG_VR v.4

Failed to write keys to ssi file [some_path]:
  external sort of primary keys failed

One assumption I had was that there were some repeating keys, but I checked and all of them are unique, so I would be very grateful for any advice on way to sort this out.

I am using Rocky Linux.

npcarter commented 1 year ago

Could you send the specific file name, or put it somewhere where I can get it from you? It looks like there’s a lot of data on that site.

Khalimat commented 1 year ago

Hi Nick,

Thank you so much!

Yup, I meant the file which contains all protein sequences:

curl 'https://genome.jgi.doe.gov/portal/ext-api/downloads/get_tape_file?blocking=true&url=/IMG_VR/download/_JAMO/63a22c8a3b5d0133c73fb0a2/IMGVR_all_proteins-high_confidence.faa.gz' -b cookies > IMGVR_all_proteins-high_confidence.faa.gz

Sorry, seems it was a mistake on my HPC side - I submitted just the same job and it did work.

cryptogenomicon commented 1 year ago

Likely a disk space limitation for tmp files. Check that you have ${TMPDIR} pointing somewhere that has a bunch of space, enough for building the index. For large databases, when the index is too large to sort in RAM, it switches to using a tmpfile and doing an on-disk sort.

Khalimat commented 1 year ago

Thank you! That is useful to know!