MikkelSchubert / paleomix

Pipelines and tools for the processing of ancient and modern HTS data.
https://paleomix.readthedocs.io/en/stable/
MIT License
43 stars 19 forks source link

BOWTIE2 errors in the pipeline #41

Closed ArchaeologyRoz closed 3 years ago

ArchaeologyRoz commented 3 years ago

Dear Mikkel,

I hope you're well! I'm working on my MPhil dissertation using paleomix to investigate ancient millet DNA. I've been able to trim adapters using the pipeline, but the aligning is causing problems. I have been able to align using both bowtie2 and BWA not through paleomix. This is my current makefile, there seems to be an issue with MinQuality 30. if I put 30 in brackets, the error changes slightly to observed value: true. I'm new to bioinformatics and I'm not sure what I'm doing wrong! Best, Roz.

-- mode: Yaml; --

Default options.

Can also be specific for a set of samples, libraries, and lanes,

by including the "Options" hierarchy at the same level as those

samples, libraries, or lanes below.

Options:

Sequencing platform, see SAM/BAM reference for valid values

Platform: Illumina

Quality offset for Phred scores, either 33 (Sanger/Illumina 1.8+)

or 64 (Illumina 1.3+ / 1.5+). For Bowtie2 it is also possible to

specify 'Solexa', to handle reads on the Solexa scale. This is

used during adapter-trimming and sequence alignment

QualityOffset: 33

Settings for trimming of reads, see AdapterRemoval man-page

AdapterRemoval:

Set and uncomment to override defaults adapter sequences

 --adapter1: AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC

--adapter2: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT

 # Some BAM pipeline defaults differ from AR defaults;
 # To override, change these value(s):
 --mm: 3
 --minlength: 25
 # Extra features enabled by default; change 'yes' to 'no' to disable
 --collapse: no
 --trimns: yes
 --trimqualities: yes

Target name; all output files uses this name as a prefix

Z2_align:

Sample name; used to tag data for segregation in downstream analyses

Z2_align:

Library name; used to tag data for segregation in downstream analyses

Z2_align:
  # Lane / run names and paths to FASTQ files
  Z2_align: Z2_S5_L001_R1_001.fastq.gz

Settings for mappings performed using Bowtie2

Bowtie2:
  # Filter aligned reads with a mapping quality (Phred) below this value
  MinQuality: 30
  # Filter reads that did not map to the reference sequence
  FilterUnmappedReads: yes
  # Examples of how to add additional command-line options

--trim5: 5

--trim3: 5

  # Note that the colon is required, even if no value is specified
  --very-sensitive:
  # Example of how to specify multiple values for an option

--rg:

- CN:SequencingCenterNameHere

- DS:DescriptionOfReadGroup

Prefixes: millet_genome: Path: ~/align.bowtie2/GCA_002895445.2_ASM289544v2_genomic.fna.gz

Error: ERROR Error reading makefiles: Makefile requirement not met at 'Z2_align :: Z2_align :: Bowtie2 :: MinQuality': Expected value: (a non-empty string) or ({(a non-empty string) : (a non-empty string)}) Observed value: 30 Observed type: int

MikkelSchubert commented 3 years ago

I can't really read the YAML file you copy/pasted (if you need to include files to a github issue, then just include them as attachments), but it looks like you've tried to override options but omitted the Options key. I'm guessing that your YAML file looks something like this:

Z2_align:
  Z2_align:
    Z2_align:
      Bowtie2:
        MinQuality: 30

But what you'd actually want is something like the following, corresponding to the Options section at the top of the file:

Z2_align:
  Z2_align:
    Z2_align:
      Options:
        Aligners:
          Bowtie2:
            MinQuality: 30

I am guessing that you just want to map all your samples with Bowtie2. In that case you're better off just changing appropriate option in the Options section at the top of the YAML file, rather than trying to override options for a specific sample. The Options listed at the top of file apply to all samples, unless specifically overridden, and you generally don't need to set options for specific samples/library/etc. unless your data is heterogeneous.

However, I also noticed that the copy pasted YAML appears to be from the trim pipeline. If you created your YAML file using paleomix trim makefile, then I'd suggest creating a new one using paleomix bam makefile instead. The trim pipeline just runs the first step of the BAM pipeline and is intended for the case where all you want to do is to trim/merge your reads. Because of that the YAML template doesn't include any of the mapping options. If you instead make a YAML file with paleomix bam makefile then it'll also include all the mapping options, including the options that lets you pick Bowtie2 instead of BWA.

MikkelSchubert commented 3 years ago

Hi Roz,

A reference genome in FASTQ format is a bit unusual. Are you sure that you have the right file? If you are sure, then you can convert it using seqtk ( https://github.com/lh3/seqtk):

$ seqtk seq -A -l 79 input.fq.gz > output.fasta

Cheers, Mikkel

Den man. 23. aug. 2021 kl. 16.12 skrev ArchaeologyRoz < @.***>:

Dear Mikkel,

Thank you! You're right I was using the wrong makefile. Unfortunately my reference genome is in fastq format (.fa.gz) not fasta, is there a program you would recommend to convert? I tried sed -n '14s/^@/>/p;24p' in.fastq

out.fasta but the output file doesn't work.

Thanks again,

Roz

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/MikkelSchubert/paleomix/issues/41#issuecomment-903810755, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASMY27KRFIHIYV6T644RWLT6JJLXANCNFSM5CUNPXTA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

ArchaeologyRoz commented 3 years ago

Dear Mikkel,

No you're right, it's a zipped fasta and not a fastq. I tried gunzip -v and the output file isn't functional still. Do you know of another way? Thank you so much for your help, I really appreciate it! My supervisor is on annual leave and I'm scratching my head a bit.

Best,

Roz

MikkelSchubert commented 3 years ago

Hi Roz,

It's a bit hard to say from where I'm sitting. Why isn't the output file functional? Did it extract correctly or did you get any errors? Is the output a FASTA file or something else? If the genome is something you've downloaded from a public website then I can take a look at it if you share the URL.

Cheers, Mikkel

ArchaeologyRoz commented 3 years ago

Dear Mikkel,

When I use the genome.fna.gz file I receive this error:

Error reading makefiles: Path for prefix 'millet_genome' does not end with .fasta:    '~/align.bowtie2/GCA_002895445.2_ASM289544v2_genomic.fna.gz'

I then tried to unzip the file using gunzip. This gives a .fasta file but the colour of the file has changed indicating it's no longer the same type of file. I tried with this file anyway and got:

Reference FASTA file does not exist: '~/align.bowtie2/GCA_002895445.2_ASM289544v2_genomic.fasta'

It's this millet genome https://www.ncbi.nlm.nih.gov/genome?LinkName=bioproject_genome&from_uid=431363

Thanks! Roz

MikkelSchubert commented 3 years ago

gunzip removes the .gz extension as part of the decompression process, so your FASTA is mostly likely named GCA_002895445.2_ASM289544v2_genomic.fna.gz now (ending with .fna). You'll need to rename it so that it ends with .fasta.

I'm also not sure off the top of my head if the pipeline supports ~/ in paths in the YAML file, so you'll probably need use the full path.

ArchaeologyRoz commented 3 years ago

Hi Mikkel,

Yes my file path was the problem!! Thank you so much. I can't promise I won't be back with more silly questions, but you'll be featuring in my acknowledgments :) If you're particularly interested in my results I can also send them over at the end of the project. I'm in the Archaeology Dept. at Cambridge.

Thanks again, paleomix is really impressive work!

MikkelSchubert commented 3 years ago

Hi Roz,

I'm glad that you got it working and thank you for the acknowledgments. Feel free to open new issues if you run into any other problems.

I also appreciate your offer, but millet aDNA is a bit outside my current areas of interest. But best of luck with your project!

Cheers, Mikkel