benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
468 stars 142 forks source link

Unable to select all files in the folder #1679

Closed nhg12 closed 4 months ago

nhg12 commented 1 year ago

HI

I am trying to run the DADA2 pipeline for PacBio full length data. I am on the step where I define the path to remove primers and I am unable to select all the sequences to run in a single command. It either removes primer from only one sequence or it says that input files are not present.

I am using the following command: fn <- file.path(path, "zymo_CCS_99_9.fastq.gz")

I want to change zymo xxx.fastq.gz to a folder so it picks up all the fastq.gz files in there. How can i do this? did i miss any step?

Thanks!

benjjneb commented 1 year ago

The beginning sections of the DADA2 tutorial should get you where you want to go. Basically you'll leverage list.files to get e.g. all the fastq files in a directory.

nhg12 commented 1 year ago

I dont know what the problem is. I managed to select all the sequences in the folder. However, when I remove the primer this way and check the size the size is very small. Hence I cannot run the next command which is filtering the sequences. I am sharing the histogram of my result here. I am struggling with the tutorial since days :(

Rplot

this is the error I get

hist(nchar(getSequences(nop)), 100) filt <- file.path(path, "noprimers", "filtered", basename(fn)) track <- fastqFilter(nop, filt, minQ=2, minLen=1000, maxLen=1600, maxN=0, rm.phix=FALSE, maxEE=12, verbose=TRUE) Error in file(con) : invalid 'description' argument In addition: Warning message: In if (fn == fout) { : the condition has length > 1 and only the first element will be used

this does not happen when I remove the primers from a single sequence. I get the correct size of the sequence

benjjneb commented 1 year ago

Is this now a different problem? It's hard to know what is going wrong without being able to see the code you are running. What is your removePrimers command? What are the diagnostics that come out of that (i.e. reads in, reads out)? What primer set are you using for your amplicon sequencing?

nhg12 commented 1 year ago

Hi I had the all sequences problem that got solved with list.files however it is not giving the right size I am following the zymo mock full length 16S tutorial using the universal primers. The reads out is good 90-97% But when i make the histogram it shows the wrong size

Get Outlook for iOShttps://aka.ms/o0ukef


From: Benjamin Callahan @.> Sent: Thursday, January 26, 2023 11:32:49 PM To: benjjneb/dada2 @.> Cc: Noor-Ul-Huda Ghori @.>; Author @.> Subject: Re: [benjjneb/dada2] Unable to select all files in the folder (Issue #1679)

Is this now a different problem? It's hard to know what is going wrong without being able to see the code you are running. What is your removePrimers command? What are the diagnostics that come out of that (i.e. reads in, reads out)? What primer set are you using for your amplicon sequencing?

— Reply to this email directly, view it on GitHubhttps://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fbenjjneb%2Fdada2%2Fissues%2F1679%23issuecomment-1405188311&data=05%7C01%7C00093057%40uwa.edu.au%7C03cacb3cfcb64a53f8c708daffb2961f%7C05894af0cb2846d8871674cdb46e2226%7C0%7C0%7C638103439744898359%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Dfm%2FnEmWGkwCtf989JUVE3jbJgVuWs54VGX8pouQn4o%3D&reserved=0, or unsubscribehttps://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAQL6QZSON6HQKTV7HYWGJLLWUKKKDANCNFSM6AAAAAAUFYIRAU&data=05%7C01%7C00093057%40uwa.edu.au%7C03cacb3cfcb64a53f8c708daffb2961f%7C05894af0cb2846d8871674cdb46e2226%7C0%7C0%7C638103439744898359%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=OZIu28nmtuUtzluShtb2%2Bqzzhpg6n5RDlWIWljX3iOY%3D&reserved=0. You are receiving this because you authored the thread.Message ID: @.***>

benjjneb commented 1 year ago

What is your removePrimers command? What are the diagnostics that come out of that (i.e. reads in, reads out)? What primer set are you using for your amplicon sequencing?

nhg12 commented 1 year ago

HI Ben,

This is the command i used path<-"/Users/hudaghori/Desktop/tki_skin" list.files(path) path.out <- "Figures/" path.rds <- "RDS/" fn <- sort(list.files(path, pattern=".fastq.gz", full.names=TRUE)) list.files(fn) F27 <- "AGRGTTYGATYMTGGCTCAG" R1492 <- "RGYTACCTTGTTACGACTT" rc <- dada2:::rc theme_set(theme_bw()) nop <- file.path(path, "noprimers", basename(fn)) prim <- removePrimers(fn, nop, primer.fwd=F27, primer.rev=dada2:::rc(R1492), orient=TRUE, verbose=TRUE)

a few outputs are

Read in 8781, output 7988 (91%) filtered sequences. Multiple matches to the primer(s) in some sequences. Using the longest possible match. 4716 sequences out of 9748 are being reverse-complemented. Read in 9748, output 8890 (91.2%) filtered sequences. Multiple matches to the primer(s) in some sequences. Using the longest possible match. Multiple matches to the primer(s) in some reverse-complement sequences. Using the longest possible match. 9252 sequences out of 18827 are being reverse-complemented. Read in 18827, output 17206 (91.4%) filtered sequences. Multiple matches to the primer(s) in some sequences. Using the longest possible match. Multiple matches to the primer(s) in some reverse-complement sequences. Using the longest possible match. 8278 sequences out of 16883 are being reverse-complemented. Read in 16883, output 15454 (91.5%) filtered sequences. Multiple matches to the primer(s) in some sequences. Using the longest possible match. Multiple matches to the primer(s) in some reverse-complement sequences. Using the longest possible match. 13983 sequences out of 28502 are being reverse-complemented. Read in 28502, output 25802 (90.5%) filtered sequences. Multiple matches to the primer(s) in some sequences. Using the longest possible match. Multiple matches to the primer(s) in some reverse-complement sequences. Using the longest possible match. 216 sequences out of 444 are being reverse-complemented. Read in 444, output 412 (92.8%) filtered sequences. Multiple matches to the primer(s) in some sequences. Using the longest possible match. Multiple matches to the primer(s) in some reverse-complement sequences. Using the longest possible match.

the primer set Is mentioned above

Thanks for your help

benjjneb commented 1 year ago

That output looks like things are working as expected. The commands are very helpful though, as they point out the issue:

nop <- file.path(path, "noprimers", basename(fn))

nop is a character vector of multiple filepaths. When you run getSequences on a character vector, it just returns that character vector. It is not pulling the sequence data from the files. Thus, the results of hist are just the number of characters in your filepaths, and the frequency is the number of files.

Inspect individual files with nop[[i]] (where i is an integer index). You can also use the plotQualityProfile(nop[[5]]) to see the quality profile of the fifth primer-removed file, including a cumulative length distribution (the red line).

nhg12 commented 1 year ago

Correct. That makes sense, but now i am confused how to run the filteration command. Because it gives me the error i mentioned earlier . Does that error means that the sequences are not the correct size?

Get Outlook for iOShttps://aka.ms/o0ukef


From: Benjamin Callahan @.> Sent: Friday, January 27, 2023 12:23:56 AM To: benjjneb/dada2 @.> Cc: Noor-Ul-Huda Ghori @.>; Author @.> Subject: Re: [benjjneb/dada2] Unable to select all files in the folder (Issue #1679)

That output looks like things are working as expected. The commands are very helpful though, as they point out the issue:

nop <- file.path(path, "noprimers", basename(fn))

nop is a character vector of multiple filepaths. When you run getSequences on a character vector, it just returns that character vector. It is not pulling the sequence data from the files. Thus, the results of hist are just the number of characters in your filepaths, and the frequency is the number of files.

Inspect individual files with nop[[i]] (where i is an integer index). You can also use the plotQualityProfile(nop[[5]]) to see the quality profile of the fifth primer-removed file, including a cumulative length distribution (the red line).

— Reply to this email directly, view it on GitHubhttps://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fbenjjneb%2Fdada2%2Fissues%2F1679%23issuecomment-1405262993&data=05%7C01%7C00093057%40uwa.edu.au%7Cb20f3408f143433448c208daffb9ba9a%7C05894af0cb2846d8871674cdb46e2226%7C0%7C0%7C638103470421073885%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=OpRkzJeITgECaHrHitZ9Z8zcp7UmEEjCyTrUY3ucVuQ%3D&reserved=0, or unsubscribehttps://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAQL6QZRLPS6L6C4OBENTIQDWUKQJZANCNFSM6AAAAAAUFYIRAU&data=05%7C01%7C00093057%40uwa.edu.au%7Cb20f3408f143433448c208daffb9ba9a%7C05894af0cb2846d8871674cdb46e2226%7C0%7C0%7C638103470421073885%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=jHHlkgiyD1QxoYs%2BJZ%2Fhi%2BXMSN84iiGyXmQiZz5YMN4%3D&reserved=0. You are receiving this because you authored the thread.Message ID: @.***>

benjjneb commented 1 year ago

What is your filterAndTrim command? What are the diagnostics that come out of that (i.e. reads in, reads out, error message)?

nhg12 commented 1 year ago

track <- fastqFilter(nop, filt, minQ=2, minLen=1000, maxLen=1600, maxN=0, rm.phix=FALSE, maxEE=12, verbose=TRUE)

Error in file(con) : invalid 'description' argument In addition: Warning message: In if (fn == fout) { : the condition has length > 1 and only the first element will be used

benjjneb commented 1 year ago

You should be using filterAndTrim, not fastqFilter.

nhg12 commented 1 year ago

Okay should i not follow the full length sequences protocol?

Get Outlook for iOShttps://aka.ms/o0ukef


From: Benjamin Callahan @.> Sent: Friday, January 27, 2023 12:49:39 AM To: benjjneb/dada2 @.> Cc: Noor-Ul-Huda Ghori @.>; Author @.> Subject: Re: [benjjneb/dada2] Unable to select all files in the folder (Issue #1679)

You should be using filterAndTrim, not fastqFilter.

— Reply to this email directly, view it on GitHubhttps://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fbenjjneb%2Fdada2%2Fissues%2F1679%23issuecomment-1405298356&data=05%7C01%7C00093057%40uwa.edu.au%7C6053c0b710534ad2d89908daffbd5255%7C05894af0cb2846d8871674cdb46e2226%7C0%7C0%7C638103485846421719%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=IP6AWu00zvf4ib48mk8THrvDvlZ3QLiDxSOeoS4GncQ%3D&reserved=0, or unsubscribehttps://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAQL6QZRRROLODSN2NVYIKSTWUKTKHANCNFSM6AAAAAAUFYIRAU&data=05%7C01%7C00093057%40uwa.edu.au%7C6053c0b710534ad2d89908daffbd5255%7C05894af0cb2846d8871674cdb46e2226%7C0%7C0%7C638103485846421719%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=jVUOrRp6WqGfKUZcac3a4LUO5gXJD3sdRfVuhe0znDE%3D&reserved=0. You are receiving this because you authored the thread.Message ID: @.***>

benjjneb commented 1 year ago

Should i not follow the full length sequences protocol?

What is the link to the "full length sequences protocol" you are following?

I guess perhaps the Zymo mock community analysis from the LRAS manuscript repository? That is analyzing a single sample. You'll have better luck copy-pasting commands from the analysis of (multiple) fecal samples in that repository: https://benjjneb.github.io/LRASManuscript/LRASms_fecal.html