faircloth-lab / phyluce

software for UCE (and general) phylogenomics
http://phyluce.readthedocs.org/
Other
76 stars 48 forks source link

Phasing snps fasta files #315

Open nicolashazzi23 opened 9 months ago

nicolashazzi23 commented 9 months ago

Hi, I am trying to run the phasing workflow with the bams files generated previously with the mapping workflow. However I am not getting the fasta files in the results, I just got the bam files (see attached image). I am attaching the conf file if that can give some help in clarifying if I am doing something wrong. My final question is regarding how to construct the final snp matrix. Because the tutorial say this after phasing "You can essentially group all the .0.fasta and .1.fasta files for all taxa together as new “assemblies” of data and start the phyluce analysis process over from phyluce_assembly_match_contigs_to_probes.". But I find this kind of ambiguous, I should create a contig folder with the 0.fasta files and a second folder with the 1.fasta files? or a folder with both files together? and How at the end I can put out the snps?

this is the command that I ran

phyluce_workflow --config bams_2.conf \ --output phasing_3 \ --workflow phasing \ --cores 1 phyluce1 phasing.txt

brantfaircloth commented 9 months ago

You will likely need to check the log files output by samtools/bamtools. The image attached is somewhat hard to see, but it does not look like some files are being created properly. Once you have fasta files, they can both go in the same folder. Extracting SNPs is up to you - e.g. depending on your needs, you can choose where to harvest the SNP calls from... or you can choose to additionally alter/update the files produced to output SNPs.

nicolashazzi23 commented 9 months ago

Hi Brant thank you very much for your help. This is the tail of the last 50 lines of the run that shows the error. It seems that is a memory capacity error but I did the run with the maximum capacity of our slurm cluster: "highMem –Nodes in this category have large memory – 3tb and are for jobs that require more memory intensive jobs", and still I got this error

thanks!

Finished processing comp53472_c0_seq1:1-252 Processing comp53486_c0_seq1:1-333 bam bams/HW_0458.0.bam: Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at java.base/java.util.Arrays.copyOf(Arrays.java:3745) at java.base/java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:172) at java.base/java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:538) at java.base/java.lang.StringBuilder.append(StringBuilder.java:174) at htsjdk.samtools.SAMTextHeaderCodec.advanceLine(SAMTextHeaderCodec.java:139) at htsjdk.samtools.SAMTextHeaderCodec.decode(SAMTextHeaderCodec.java:94) at htsjdk.samtools.BAMFileReader.readHeader(BAMFileReader.java:667) at htsjdk.samtools.BAMFileReader.(BAMFileReader.java:298) at htsjdk.samtools.BAMFileReader.(BAMFileReader.java:176) at htsjdk.samtools.SamReaderFactory$SamReaderFactoryImpl.open(SamReaderFactory.java:396) at htsjdk.samtools.SamReaderFactory$SamReaderFactoryImpl.open(SamReaderFactory.java:208) at org.broadinstitute.pilon.BamFile.reader(BamFile.scala:51) at org.broadinstitute.pilon.BamFile.process(BamFile.scala:116) at org.broadinstitute.pilon.GenomeRegion.processBam(GenomeRegion.scala:292) at org.broadinstitute.pilon.GenomeFile.$anonfun$processRegions$5(GenomeFile.scala:112) at org.broadinstitute.pilon.GenomeFile.$anonfun$processRegions$5$adapted(GenomeFile.scala:112) at org.broadinstitute.pilon.GenomeFile$$Lambda$48/0x00000001001a5840.apply(Unknown Source) at scala.collection.immutable.List.foreach(List.scala:388) at org.broadinstitute.pilon.GenomeFile.$anonfun$processRegions$4(GenomeFile.scala:112) at org.broadinstitute.pilon.GenomeFile.$anonfun$processRegions$4$adapted(GenomeFile.scala:109) at org.broadinstitute.pilon.GenomeFile$$Lambda$44/0x00000001001a0040.apply(Unknown Source) at scala.collection.Iterator.foreach(Iterator.scala:937) at scala.collection.Iterator.foreach$(Iterator.scala:937) at scala.collection.AbstractIterator.foreach(Iterator.scala:1425) at scala.collection.parallel.ParIterableLike$Foreach.leaf(ParIterableLike.scala:970) at scala.collection.parallel.Task.$anonfun$tryLeaf$1(Tasks.scala:49) at scala.collection.parallel.Task$$Lambda$45/0x00000001001a7840.apply$mcV$sp(Unknown Source) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12) at scala.util.control.Breaks$$anon$1.catchBreak(Breaks.scala:63) at scala.collection.parallel.Task.tryLeaf(Tasks.scala:52) at scala.collection.parallel.Task.tryLeaf$(Tasks.scala:46) at scala.collection.parallel.ParIterableLike$Foreach.tryLeaf(ParIterableLike.scala:967) [Sun Sep 24 12:49:54 2023] Error in rule pilon_allele_0: jobid: 381 output: fastas/HW_0458.0.fasta shell: pilon --threads 1 --vcf --changes --fix snps,indels --minqual 10 --mindepth 5 --genome /lustre/groups/hormigalab/NHazzi/wood/contigs/HW_0458.contigs.fasta --bam bams/HW_0458.0.bam --outdir fastas --output HW_0458.0 (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message

brantfaircloth commented 9 months ago

Looks like you need to feed pilon some more RAM (running it on a large node is usually not quite enough). This should only require modifying the workflow script in the pilon section here to read like:

'pilon -jar -Xmx256G --threads {threads}...

Where you'll change the 256G to something that works for your HPC. This sets the max RAM pilon can use (by default it is 1 G).

nicolashazzi23 commented 9 months ago

Hi Brant, thank you very much! it work thanks to your suggestion!

brantfaircloth commented 9 months ago

Excellent 👍

nicolashazzi23 commented 9 months ago

Hi Brant sorry for bothering you again but I would like to kindly ask you again about the fasta 0 and 1 files generated by the phasing process. I want to estimate species trees and also get SNPs for a Structure analysis. When I put the 0.fasta and the 1.fasta files in the same folder as you suggested, and ran the function phyluce_assembly_match_contigs_to_probes function, and I got the following error "sqlite3.OperationalError: duplicate column name: HW_0302". Therefore, I should merge the 0.fasta and 1.fasta files using the cat function? or what should I do after the phasing with the 0.fasta and 1.fasta files? thanks in advance!