brian-cleary / LatentStrainAnalysis

Partitioning and analysis methods for large, complex sequence datasets
MIT License
37 stars 20 forks source link

error during HashCounting.sh with large files (w/o LSFS) #11

Open noriko-cassman opened 8 years ago

noriko-cassman commented 8 years ago

Hello! Thanks for the nice paper.

The test data, using the bash scripts, ran fine on our cluster system, as well as on 10K subsets of my 18 samples, using the same but modified bash scripts. However, when I ran LSA on the full files (18 samples, each about 500 MB) I was getting errors during HashCounting.sh, even when running with up to 40 threads. I am not using the LSFS system.

Here is the error message:

parallel: This job failed: echo $(date) writing k-mer corpus for file 2; \ python LSA/kmer_corpus.py -r 2 -i Vhashed_reads/ -o Vcluster_vectors/ >> VLogs/KmerCorpus.log 2>&1 printing end of last log file... hashobject.kmer_corpus_to_disk(Kmer_Hash_Count_Files[fr],mask=M) IndexError: list index out of range Traceback (most recent call last): File "LSA/kmer_corpus.py", line 33, in hashobject.kmer_corpus_to_disk(Kmer_Hash_Count_Files[fr],mask=M) IndexError: list index out of range Traceback (most recent call last): File "LSA/kmer_corpus.py", line 33, in hashobject.kmer_corpus_to_disk(Kmer_Hash_Count_Files[fr],mask=M) IndexError: list index out of range

Something funny, when I looked at the Log files for the test data and my subset data, I found similar errors as with the full data (attached below). Looking up the errors, I thought maybe they had to do with this: http://stackoverflow.com/questions/4964101/pep-3118-warning-when-using-ctypes-array-as-numpy-array.

Here are outputs that you requested for other issues from the run with the full dataset: HashReads.log KmerCorpus.log CombineFractions.log MergeHash.log

Note: GlobalWeights.log and CreateHash.log were empty.

ls -l Vcluster_vectors.txt ls -l Vhashed_reads.txt ls -l Voriginal_reads.txt

Here are the log files ad outputs from the run with your test data: CombineFractions.log CreateHash.log HashReads.log KmerClusterIndex.log KmerCorpus.log KmerLSI.log MergeIntermediatePartitions.log ReadPartitions.log

Note: these were empty GlobalWeights.log KmerClusterCols.log KmerClusterMerge.log KmerClusterParts.log MergeHash.log

ls -l cluster_vectors ls -l hashed_reads ls -l original_reads

Thanks in advance, Nori