faircloth-lab / phyluce

software for UCE (and general) phylogenomics
http://phyluce.readthedocs.org/
Other
78 stars 49 forks source link

Obtaining a final .vcf file with ambiguities for diploid organisms #285

Open marianamazzochi opened 1 year ago

marianamazzochi commented 1 year ago

Dear Brant, My project is with 6 populations of a seabird species. I am aiming to estimate their genetic structure and other parameters. I used your pipeline for trimming and assembling my fastq data, as we have talked before. However, I now have realized that I have obtained data for haploid individuals, without any ambiguities. I don't know if I've missed some part of the tutorial, but I couldn't find out how to maintain the ambiguities (or obtain two sequences per individual, with all alleles for that individual). Could you, please, help me?

Cheers,

brantfaircloth commented 1 year ago

This is somewhat out of the scope of bug reports about the Phyluce software. There are also lots of options you could pursue that are specific to your project and what you are trying to do with your data. Because you have fasta files for a set of individuals now, you can select one individual as your "reference" fasta and call SNPs for the individuals in each population against that reference. One way that we do this in my lab are detailed here: http://protocols.faircloth-lab.org/en/latest/protocols-computer/analysis/analysis-gatk-parallel.html. That said, there are many different ways to do the same sorts of things...

marianamazzochi commented 1 year ago

Thanks, Brant. I am following the mentioned pipeline, but I think the trimming part contains a bug - the command 'module' is not working. I tried to google some solutions, but it appears that 'module' doesn't work anymore due to the disablement of the set_shell_startup configuration which is now the default with this environment-modules update. Actually, I can't even install environment-modules, which outputs 'Invalid operation'.

So, I am stuck at this part of the pipeline:


!/bin/bash

PBS -A

PBS -l nodes=2:ppn=20

PBS -l walltime=2:00:00

PBS -q checkpt

PBS -N multi_trimmomatic

SET THE NUMBER of Cores per job (needs to be multiple of 2)

export CORES_PER_JOB=4

DONT EDIT BELOW

We need java to run trimmomatic

module load jdk/1.8.0_161 module load gnuparallel/20170122

move into the directory containing this script

cd $PBS_O_WORKDIR

automatically set the number of Jobs per node based on $CORES_PER_JOB

export JOBS_PER_NODE=$(($PBS_NUM_PPN / $CORES_PER_JOB))

parallel --colsep '\,' \ --progress \ --joblog logfile.trimmomatic.$PBS_JOBID \ -j $JOBS_PER_NODE \ --slf $PBS_NODEFILE \ --workdir $PBS_O_WORKDIR \ -a files-to-trim.txt \ ./trimmomatic-sub.sh {$1} {$2}


Do you know a different way to call java and parallel to substitute 'module' command? Thanks again,

brantfaircloth commented 1 year ago

Howdy,

These are just examples of how you might go about running these types of analyses - they are written for our particular HPC environment. As a result, they'll need to be modified for your particular environment in order to run correctly (you should also have all the trimmomatic parts already run from the phyluce pipeline.

The important commands to focus on are those running particular programs, usually at the bottom of each script that I sent you. You should be able to modify this to work with whichever environment you are using for analysis.

-b

marianamazzochi commented 1 year ago

Brant, I really do have the trimmomatic parts performed by phyluce, but it seems like I had removed all ambiguities following that pipeline. I need a file which contains specific ambiguities, like Y and R. Am I able to get that using the files I obtained from phyluce pipeline?

brantfaircloth commented 1 year ago

If you follow the standard pipeline (e.g. Tutorial 1), the results output by that approach do not contain variable positions (e.g. Y, R, and the other IUPAC base codes). They are not meant to, because variable positions can cause problems in some phylogenetic analysis programs.

If you need alignments with variable positions (or VCF files with variant bases), then you will need to treat your data in a "custom" way - meaning that you will likely have to move outside of what is supported by (and described for) Phyluce. One way to do that is using an approach similar to what I described, above. Another way to do that could be to try the phasing pipeline, but that still may not produce the exact data that you need.

In short, what you need to do is somewhat specific to the goals of your project - and you'll need to decide which method is best achieve those goals and how to implement that (or those) methods.