faircloth-lab / phyluce

software for UCE (and general) phylogenomics
http://phyluce.readthedocs.org/
Other
80 stars 49 forks source link

phasing workflow output question #255

Closed Motikuko closed 3 years ago

Motikuko commented 3 years ago

Hello Dr. Faircloth, I am running the phasing workflow from phyluce 1.7 to get SNP data for popgen analysis, and it is taking days to finish. For some samples vcf.0 and vcf.1 files are created before the job times out. I was wondering if the workflow is complete for those samples, so that I can exclude them when I restart the job. Thank you.

brantfaircloth commented 3 years ago

Howdy,

If this is taking days to finish for a single sample, something sounds wrong. It might be that the machine you are running the analysis on needs additional CPUs or it might be that the machine you are running on is RAM limited (pretty likely).

Snakemake should track which jobs have completed, so if you need to restart a job, the job should start in the correct place (meaning it will not re-run samples that have finished).

Motikuko commented 3 years ago

Thank you Dr Faircloth. I restarted the job changing the output name (output_2) or I would get an error. But I am noticing the bam and fasta folders in output_2 have the same samples as in output_1. I am wondering if I should have added a flag when I restarted the job, so that Snakemake knows to continue the job and append the files to output_1 instead of re-running all the samples? Any suggestions would be appreciated! Moti.

brantfaircloth commented 3 years ago

If you move the output to a new folder, the code will start over again (because it does not recognize what's been done and not done). That said, there's a bug in the code - you need to restart the snakemake job... but there is not an easy way for you to pass the --restart flag to snakemake (that is an oversight on my part).

Unfortunately, I'm not sure when I can get around to fixing this (I work on phyluce in my free time). I'll make a note to repair it at some point.

That still doesn't help your job complete, however... I would try restarting the job on a larger machine (more CPUs and/or RAM) and seeing where that gets you.

Motikuko commented 3 years ago

I will do that for sure. Thank you! One last question, I hope you don't mind, how would you recommend merging vcf.0 and vcf.1 files and moving all samples into one .vcf file will the complete phased data for further popgen analysis? I have been searching but I can't find a clear solution for this particular situation with samtools. Thank you so much for always replying to our inquiries. We appreciate you taking the time =) Moti.

brantfaircloth commented 3 years ago

If you want phased SNPs, one option is to use samtools phase with the BAM files produced during the mapping stage. Or, you can take a different route and use GATK with your samples (and read-backed phasing). You would need to do both of those manually (e.g. the phasing parts). If you want sequences representing each putative phase for alignment and analysis, you can convert each vcf file to a sequence representing each haplotype with a tool like vcftophylip.

Motikuko commented 3 years ago

Thank you so much for the advice, Dr. Faircloth !!! I appreciate all of your help! Moti.