cgroza / GraffiTE

GraffiTE is a pipeline that finds polymorphic transposable elements in genome assemblies and/or long reads, and genotypes the discovered polymorphisms in read sets using genome-graphs.
Other
121 stars 6 forks source link

genotyping with assemblies only #15

Closed davidaray closed 1 year ago

davidaray commented 1 year ago

Note: This was also sent to Clement via e-mail. Then I realized I should ask through here instead. Sorry for the duplication.

I saw your GraffiTE package a few days ago and just finished installing for a test run.

The test data ran successfully, so now it's time for a test using our data.

I recently came into possession of two haplotypes for a single individual and thought this might be a useful scenario, trying to identify polymorphisms in the two haplotypes of the diploid genome.

According to the documentation on github, all of these are required:

nextflow run cgroza/GraffiTE \
   --assemblies assemblies.csv \
   --TE_library library.fa \
   --reference reference.fa \
   --graph_method pangenie \
   --reads reads.csv

No problem with nearly all of these. But, the documentation also says that you can perform the genotyping using only assemblies, as is the case I want to try.

From the paper: "pMEs can be detected from genome assemblies or any type of long-read data, and genotyping can be performed using short- and long-read sets. This flexibility allows researchers to get the most out of their data; for example, by performing the initial SV search with high-quality – though perhaps less abundant – data, such as chromosome-level assemblies and long-read sequences, while genotyping in larger cohorts or populations using cost-effective short-read sets."

I haven't tried the run with the two assemblies yet but, given the wording on github, I'm going to get an error if I don't include the --reads option.

Is this something I'm going to need to worry about? How do I get around this, if possible?

Just noticed another potential problem:

--graph_method: can be pangenie, giraffe or graphaligner, select which graph method will be used to genotyped TEs. Default is pangenie and it is optimized for short-reads. giraffe can handle both short and long reads, and graphaligner is optimized for long reads.

None of these mention using only assemblies? Assuming what I'm asking is possible, which, if any of these, should I choose? graphaligner?

David

davidaray commented 1 year ago

There it is! I completely missed this in the first read-through of the documentation:

--genotype: true or false. Use this if you would like to discover polymorphisms in assemblies but you would like to skip genotyping polymorphisms from reads.

Ignore, please.