Closed Fedster closed 8 years ago
Hi Fedster, I can't tell what the problem might be without seeing the output or knowing more about your HPC setup. Your params file looks fine to me, though. In general 6 samples of that size should take just several hours, not several days. Perhaps try running just a single step first to make sure you have the software installed properly. Look for the intermediate files generated by each step (explained in the tutorial) to see whether it is running at all.
Hi Deren,
the cluster is this one: https://research.csc.fi/taito-user-guide
Specifically I am using the longmem queue which has 16/24 cores for each node and runs on a Sandy Bridge / Haswell node (SB for 16 cores, H for 24).
I can ask more detailed info if needed. I have no output yet, but one question: in the directory where the files reside there is another directory and I just noticed this traceback:
------------------------------------------------------------
pyRAD : RADseq for phylogenetics & introgression analyses
------------------------------------------------------------
sorted .fastq from /wrk/me/HR/HR-parents/ being used
step 2: editing raw reads
Process Worker-7:
Traceback (most recent call last): File "/appl/opt/python/2.7.10-gcc493-shared/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/homeappl/home/me/appl_taito/pyrad/pyrad/potpour.py", line 31, in run res = self.func(*job) File "/homeappl/home/me/appl_taito/pyrad/pyrad/editraw_rads.py", line 88, in rawedit f = open(infile,'r') IOError: [Errno 21] Is a directory: '/wrk/me/HR/HR-parents/HR'
As a test I moved the directory /wrk/me/HR/HR-parents/HR elsewhere so that in /wrk/me/HR/HR-parents/ there nothing but the data needed by pyrad. That might be the trick because now I am getting more stuff running. I will report back (though the fact that the dir containing the data should not contain other stuff is pretty puzzling).
Could it be related with the "unorthodox" way you are running pyRAD? As was mentioned in #26 ?
Add a wild-card selector (*) to the end of your param 18 entry so that it is indicating the path to the fastq files in the directory rather than simply to a directory.
I doubt it has any relationship with how I use pyrad, since now everything is running ok -- I will test Deren' suggestion later but making sure the directory where the data is contains nothing but the data worked.
Hi,
I am running pyrad on a grand total of 6 samples, and it has been running for almost 2 weeks with little to show for. Specifically I am running it on a HPC cluster with this bash shell:
!/bin/bash -l
SBATCH -J py_fl
SBATCH -o py_fl.out
SBATCH -e py_fl.err
SBATCH -t 14-00:00:0
SBATCH -n 1
SBATCH --nodes=1
SBATCH --cpus-per-task=16
SBATCH -p longrun
SBATCH --mem-per-cpu=5600
module load muscle/3.8.31 module load gcc/4.9.3 intelmpi/5.1.1 python/2.7.10
python ~/appl_taito/pyrad/pyrad/pyRAD.py -p /homeappl/home/me/HR-SNPs/params.txt -s 234567
The files it inputting are quite large:
649M HB-F.fastq.gz 562M HB-M.fastq.gz 1.1G HP-F.fastq.gz 904M HP-M.fastq.gz 2.5G HR-F.fastq.gz 2.5G HR-M.fastq.gz
==* parameter inputs for pyRAD version 3.0.63 _======================== affected step == /homeappl/home/me/HR-SNPs/ ## 1. Working directory (all)
2. Loc. of non-demultiplexed files (if not line 18) (s1)
/homeappl/home/me/appl_taito/vsearch/bin/vsearch ## 4. command (or path) to call vsearch (or usearch) (s3,s6) /appl/bio/muscle/bin/muscle ## 5. command (or path) to call muscle (s3,s7) TGCAG ## 6. Restriction overhang (e.g., C|TGCAG -> TGCAG) (s1,s2) 16 ## 7. N processors (parallel) (all) 5 ## 8. Mindepth: min coverage for a cluster (s4,s5) 0 ## 9. NQual: max # sites with qual < 20 (or see line 20)(s2) .88 ## 10. Wclust: clustering threshold as a decimal (s3,s6) rad ## 11. Datatype: rad,gbs,pairgbs,pairddrad,(others:see docs)(all) 4 ## 12. MinCov: min samples in a final locus (s7) 6 ## 13. MaxSH: max inds with shared hetero site (s7) hrsnps ## 14. Prefix name for final output (no spaces) (s7) ==== optional params below this line =================================== affected step ==
15.opt.: select subset (prefix_ only selector) (s2-s7)
/wrk/me/HR/HR-parents/ ## 18.opt.: loc. of de-multiplexed data (s2)
19.opt.: maxM: N mismatches in barcodes (def= 1) (s1)
1 ## 21.opt.: filter: def=0=NQual 1=NQual+adapters. 2=strict (s2)
22.opt.: a priori E,H (def= 0.001,0.01, if not estimated) (s5)
0 ## 23.opt.: maxN: max Ns in a cons seq (def=5) (s5) 6 ## 24.opt.: maxH: max heterozyg. sites in cons seq (def=5) (s5) 2 ## 25.opt.: ploidy: max alleles in cons seq (def=2;see docs) (s4,s5) 5 ## 26.opt.: maxSNPs: (def=100). Paired (def=100,100) (s7)
27.opt.: maxIndels: within-clust,across-clust (def. 3,99) (s3,s7)
0,1 ## 29.opt.: trim overhang left,right on final loci, def(0,0) (s7)
30.opt.: output formats: p,n,a,s,v,u,t,m,k,g,* (see docs) (s7)
1 ## 35.opt.: use hierarchical clustering (def.=0, 1=yes) (s6)
36.opt.: repeat masking (def.=1='dust' method, 0=no) (s3,s6)
16 ## 37.opt.: vsearch max threads per job (def.=6; see docs) (s3,s6) ==== optional: list group/clade assignments below this line (see docs) ==================
I am at loss to understand what is going on and why pyrad is taking so long -- especially because my HPC runs are limited to 14 days.