crisprVerse / screenCounter

Get barcode counts from functional genomics screens
https://bioconductor.org/packages/screenCounter
MIT License
3 stars 0 forks source link

unknown base "N" error. #6

Closed williambakhache closed 9 months ago

williambakhache commented 9 months ago

Hello,

Wonderful tool, it's been working very well with some of my subseted fastq file.

I tried running it on my NGS data. However, I'm getting this error:

Error: BiocParallel errors 1 remote errors, element index: 1 0 unevaluated and other errors first remote error: Error in eval(expr, envir, enclos): unknown base 'N'

I'm thinking it has to do with some N bases in my sequences. Interestingly, a smaller fastq file with similar sequences work.

Let me know of any thoughts on how to fix this. Thanks for developing this tool.

William

LTLA commented 9 months ago

Hm. I thought I fixed this in the last release cycle:

library(screenCounter)

 # Creating an example dual barcode sequencing experiment.
 known.pool <- c("AGAGAGAGA", "CTCTCTCTC",
     "GTGTGTGTG", "CACACACAC")

# Adding some N's to the sequence data.
 N <- 1000
 barcodes <- sprintf("CAGCTANNCGTACG%sCCAGCTCGANNTCG",
    sample(known.pool, N, replace=TRUE))
 names(barcodes) <- seq_len(N)

 library(Biostrings)
 tmp <- tempfile(fileext=".fastq")
 writeXStringSet(DNAStringSet(barcodes), filepath=tmp, format="fastq")

 # Counting the combinations.
 countSingleBarcodes(tmp, choices=known.pool,
     template="CGTACGNNNNNNNNNCCAGCTC")
## DataFrame with 4 rows and 2 columns
##       choices    counts
##   <character> <integer>
## 1   AGAGAGAGA       270
## 2   CTCTCTCTC       224
## 3   GTGTGTGTG       262
## 4   CACACACAC       244

Make sure you're running the latest version (1.2.0) from Bioconductor.

williambakhache commented 9 months ago

Thank you so much! I'll check out if we have the latest version on our juypterhub.

One last question: is it possible to extract the read ID for each barcode?

Thank you for developing this.

Sent from Outlook for Androidhttps://aka.ms/AAb9ysg


From: Aaron Lun @.> Sent: Friday, January 19, 2024 1:41:46 am To: crisprVerse/screenCounter @.> Cc: williambakhache @.>; Author @.> Subject: Re: [crisprVerse/screenCounter] unknown base "N" error. (Issue #6)

Hm. I thought I fixed this in the last release cycle:

library(screenCounter)

Creating an example dual barcode sequencing experiment.

known.pool <- c("AGAGAGAGA", "CTCTCTCTC", "GTGTGTGTG", "CACACACAC")

Adding some N's to the sequence data.

N <- 1000 barcodes <- sprintf("CAGCTANNCGTACG%sCCAGCTCGANNTCG", sample(known.pool, N, replace=TRUE)) names(barcodes) <- seq_len(N)

library(Biostrings) tmp <- tempfile(fileext=".fastq") writeXStringSet(DNAStringSet(barcodes), filepath=tmp, format="fastq")

Counting the combinations.

countSingleBarcodes(tmp, choices=known.pool, template="CGTACGNNNNNNNNNCCAGCTC")

DataFrame with 4 rows and 2 columns

choices counts

1 AGAGAGAGA 270

2 CTCTCTCTC 224

3 GTGTGTGTG 262

4 CACACACAC 244

Make sure you're running the latest version (1.2.0) from Bioconductorhttps://bioconductor.org/packages/release/bioc/html/screenCounter.html.

— Reply to this email directly, view it on GitHubhttps://github.com/crisprVerse/screenCounter/issues/6#issuecomment-1899852265, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AGZJIDRNXLGHCTYJRLKPCLLYPIISNAVCNFSM6AAAAABCBCVIFOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJZHA2TEMRWGU. You are receiving this because you authored the thread.Message ID: @.***>

LTLA commented 9 months ago

One last question: is it possible to extract the read ID for each barcode?

Currently not, it's all aggregated in the underlying C++ libraries.

I suppose we could report the read names associated with each barcode, but that could use an awful lot of memory for a deeply sequenced experiment. There may or may not be a better way to do what you actually want to do.

williambakhache commented 9 months ago

Thanks for your reply. For now I'm just using this package for doing quality check of my random barcode library.

In the future, I want to link a barcode with a certain genotype in that read.

Best wishes

William


From: Aaron Lun @.> Sent: Friday, January 19, 2024 5:55 PM To: crisprVerse/screenCounter @.> Cc: williambakhache @.>; Author @.> Subject: Re: [crisprVerse/screenCounter] unknown base "N" error. (Issue #6)

One last question: is it possible to extract the read ID for each barcode?

Currently not, it's all aggregated in the underlying C++ libraries.

I suppose we could report the read names associated with each barcode, but that could use an awful lot of memory for a deeply sequenced experiment. There may or may not be a better way to do what you actually want to do.

— Reply to this email directly, view it on GitHubhttps://github.com/crisprVerse/screenCounter/issues/6#issuecomment-1900673485, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AGZJIDRUOS2RPPFJ5VPUBALYPKJQ7AVCNFSM6AAAAABCBCVIFOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBQGY3TGNBYGU. You are receiving this because you authored the thread.Message ID: @.***>

williambakhache commented 9 months ago

Hello,

Just to let you know that this fixed it for me.

Works like a charm.

William

LTLA commented 9 months ago

Ok, great.

As for the other question: when you have more clarity on the nature of the problem, make another issue and we can see what we can do. It may be possible to adapt the C++ code underlying the countCombinatorialBarcodes function so that it captures the combination of genotype with a random barcode (assuming that we're dealing with a simple SNP).