dereneaton / pyrad

Assembly and analysis of RADseq data sets
GNU General Public License v3.0
27 stars 16 forks source link

pyrad really slow? #28

Closed Fedster closed 8 years ago

Fedster commented 8 years ago

Hi,

I am running pyrad on a grand total of 6 samples, and it has been running for almost 2 weeks with little to show for. Specifically I am running it on a HPC cluster with this bash shell:

!/bin/bash -l

SBATCH -J py_fl

SBATCH -o py_fl.out

SBATCH -e py_fl.err

SBATCH -t 14-00:00:0

SBATCH -n 1

SBATCH --nodes=1

SBATCH --cpus-per-task=16

SBATCH -p longrun

SBATCH --mem-per-cpu=5600

module load muscle/3.8.31 module load gcc/4.9.3 intelmpi/5.1.1 python/2.7.10

python ~/appl_taito/pyrad/pyrad/pyRAD.py -p /homeappl/home/me/HR-SNPs/params.txt -s 234567

The files it inputting are quite large:

649M HB-F.fastq.gz 562M HB-M.fastq.gz 1.1G HP-F.fastq.gz 904M HP-M.fastq.gz 2.5G HR-F.fastq.gz 2.5G HR-M.fastq.gz

==* parameter inputs for pyRAD version 3.0.63 _======================== affected step == /homeappl/home/me/HR-SNPs/ ## 1. Working directory (all)

2. Loc. of non-demultiplexed files (if not line 18) (s1)

                   ## 3. Loc. of barcode file (if not line 18)             (s1)

/homeappl/home/me/appl_taito/vsearch/bin/vsearch ## 4. command (or path) to call vsearch (or usearch) (s3,s6) /appl/bio/muscle/bin/muscle ## 5. command (or path) to call muscle (s3,s7) TGCAG ## 6. Restriction overhang (e.g., C|TGCAG -> TGCAG) (s1,s2) 16 ## 7. N processors (parallel) (all) 5 ## 8. Mindepth: min coverage for a cluster (s4,s5) 0 ## 9. NQual: max # sites with qual < 20 (or see line 20)(s2) .88 ## 10. Wclust: clustering threshold as a decimal (s3,s6) rad ## 11. Datatype: rad,gbs,pairgbs,pairddrad,(others:see docs)(all) 4 ## 12. MinCov: min samples in a final locus (s7) 6 ## 13. MaxSH: max inds with shared hetero site (s7) hrsnps ## 14. Prefix name for final output (no spaces) (s7) ==== optional params below this line =================================== affected step ==

15.opt.: select subset (prefix_ only selector) (s2-s7)

                   ## 16.opt.: add-on (outgroup) taxa (list or prefix_)         (s6,s7)
                   ## 17.opt.: exclude taxa (list or prefix_)                   (s7)

/wrk/me/HR/HR-parents/ ## 18.opt.: loc. of de-multiplexed data (s2)

19.opt.: maxM: N mismatches in barcodes (def= 1) (s1)

                   ## 20.opt.: phred Qscore offset (def= 33)                    (s2)

1 ## 21.opt.: filter: def=0=NQual 1=NQual+adapters. 2=strict (s2)

22.opt.: a priori E,H (def= 0.001,0.01, if not estimated) (s5)

0 ## 23.opt.: maxN: max Ns in a cons seq (def=5) (s5) 6 ## 24.opt.: maxH: max heterozyg. sites in cons seq (def=5) (s5) 2 ## 25.opt.: ploidy: max alleles in cons seq (def=2;see docs) (s4,s5) 5 ## 26.opt.: maxSNPs: (def=100). Paired (def=100,100) (s7)

27.opt.: maxIndels: within-clust,across-clust (def. 3,99) (s3,s7)

                   ## 28.opt.: random number seed (def. 112233)              (s3,s6,s7)

0,1 ## 29.opt.: trim overhang left,right on final loci, def(0,0) (s7)

I am at loss to understand what is going on and why pyrad is taking so long -- especially because my HPC runs are limited to 14 days.

dereneaton commented 8 years ago

Hi Fedster, I can't tell what the problem might be without seeing the output or knowing more about your HPC setup. Your params file looks fine to me, though. In general 6 samples of that size should take just several hours, not several days. Perhaps try running just a single step first to make sure you have the software installed properly. Look for the intermediate files generated by each step (explained in the tutorial) to see whether it is running at all.

Fedster commented 8 years ago

Hi Deren,

the cluster is this one: https://research.csc.fi/taito-user-guide

Specifically I am using the longmem queue which has 16/24 cores for each node and runs on a Sandy Bridge / Haswell node (SB for 16 cores, H for 24).

I can ask more detailed info if needed. I have no output yet, but one question: in the directory where the files reside there is another directory and I just noticed this traceback:

 ------------------------------------------------------------
  pyRAD : RADseq for phylogenetics & introgression analyses
 ------------------------------------------------------------

sorted .fastq from /wrk/me/HR/HR-parents/ being used
step 2: editing raw reads
Process Worker-7:

Traceback (most recent call last): File "/appl/opt/python/2.7.10-gcc493-shared/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/homeappl/home/me/appl_taito/pyrad/pyrad/potpour.py", line 31, in run res = self.func(*job) File "/homeappl/home/me/appl_taito/pyrad/pyrad/editraw_rads.py", line 88, in rawedit f = open(infile,'r') IOError: [Errno 21] Is a directory: '/wrk/me/HR/HR-parents/HR'

Fedster commented 8 years ago

As a test I moved the directory /wrk/me/HR/HR-parents/HR elsewhere so that in /wrk/me/HR/HR-parents/ there nothing but the data needed by pyrad. That might be the trick because now I am getting more stuff running. I will report back (though the fact that the dir containing the data should not contain other stuff is pretty puzzling).

StuntsPT commented 8 years ago

Could it be related with the "unorthodox" way you are running pyRAD? As was mentioned in #26 ?

dereneaton commented 8 years ago

Add a wild-card selector (*) to the end of your param 18 entry so that it is indicating the path to the fastq files in the directory rather than simply to a directory.

Fedster commented 8 years ago

I doubt it has any relationship with how I use pyrad, since now everything is running ok -- I will test Deren' suggestion later but making sure the directory where the data is contains nothing but the data worked.