cancerit / NanoSeq

Analysis software for Nanorate Sequencing (NanoSeq) experiments
GNU Affero General Public License v3.0
13 stars 8 forks source link

very low duplicate rate #37

Closed LisaHagenau closed 1 year ago

LisaHagenau commented 2 years ago

Hello,

I am having some trouble with our first NanoSeq results, but I am not sure whether this is a wetlab or bioinformatics problem (though I suspect the former). I processed the data as far necessary to run the efficiency_nanoseq.pl script, which returned an extremely low duplicate rate (0.028). Below are the commmands I ran to create these results:

# extract tags
python ~/src/nanoseq/bin/extract-tags.py -a  data/raw/fastq/merged/S352_S1_R1.merged.fq -b data/raw/fastq/merged/S352_S1_R2.merged.fq -c data/processed/S352_extrR1.fastq -d data/processed/S352_extrR2.fastq -m 3 -s 4 -l 151

# map with bwa, add rb and mb tags
bwa mem -t 12 -C /mnt/genomes/hg/hg38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna data/processed/S352_extrR1.fastq data/processed/S352_extrR2.fastq > data/processed/S352_mapped.sam

# Add rc and mc tags, mark optical duplicates, create read bundle tags
bamsormadup inputformat=sam rcsupport=1 threads=12 < data/processed/S352_mapped.sam > data/processed/S352_mapped_od.bam
bammarkduplicatesopt optminpixeldif=2500 threads=12 < data/processed/S352_mapped_od.bam > data/processed/S352_mapped_mdo.bam
bamaddreadbundles -I data/processed/S352_mapped_mdo.bam -O data/processed/S352_filtered.bam

# run randomreadinbundle to produce a deduplicated (neat) bam
randomreadinbundle -I data/processed/S352_filtered.bam -O data/processed/S352_neat.bam

# index bam files
samtools index data/processed/S352_neat.bam
samtools index data/processed/S352_filtered.bam

# run efficiency.pl 
efficiency_nanoseq.pl -d data/processed/S352_neat.bam -x data/processed/S352_filtered.bam  -r /mnt/genomes/hg/hg38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna -o S352

# cat S352.tsv
# Whole-genome metrics:
NUM_UNIQUE_READS    302800159
NUM_SEQUENCED_READS 311577682
DUPLICATE_RATE  0.0281712186304794
# RB metrics are reported for chr/contig chr1 only:
TOTAL_RBS   11799643
TOTAL_READS_IN_RBS  24611888
OK_RBS(2+2) 16452
READS_PER_RB    1.042908
F-EFF   0.008890473
EFFICIENCY  0.00111409575730232
GC_BOTH 0.3772532
GC_SINGLE   0.3769166

What doesn't quite make sense to me is that in the duplex (filtered) bam file, 244 Mio reads are marked as duplicates (out of 311 Mio, using samtools flagstat), while the deduplicated (neat) bam file contains 302 Mio reads and no marked duplicates. Wouldn't that mean that most duplicate reads are in different read bundles?

If the bioinformatic analysis is correct, then I suspect that something went wrong with the library quantification which resulted in a massive underestimation of amplifiable fragments. We used a different kit for the qPCR than described in the methods (NEBNext Library Quant), but checked that the primers that come with the kit are the same as in the KAPA kit and added the NanoqPCR primers to a final concentration of approx 330 nM. We did observe ~10x lower library yields than described in the paper even with high DNA input from freshly prepared HMW DNA from HaCaT cells (see plot). For the sequenced sample (fibroblasts) we tried 3 dilutions (1:50, 1:500 and 1:5000), but only the 1:5000 sample showed a normal amplification curve which we used for calculating the fmol input.

lib-yield

Any help would be appreciated.

Thanks, Lisa

fa8sanger commented 2 years ago

Hi Lisa,

I am sorry it didn’t work for you. Don’t rely on traditional duplicate metrics, with enzymatic digestion most fragments share the mapping coordinates. You have to rely on the barcodes to group reads into duplicate families (read bundles).

How many fmols did you take into the amplification? It seems you sequenced your library to 30x, for that we recommended 0.6 fmol (at present we are recommending 0.4 fmol). However, there are probably slight differences between different quantification methods. Those differences however shouldn’t change from 75-80% duplicate rates to 2.9%. If you have high concentrations of DNA it would be advisable to make dilutions before taking fmols.

Your yields are much lower, I am not sure why. Perhaps they are not the true yields and that’s what made things go wrong?

I hope we can find out what went wrong.

Best wishes, Fede

On 4 May 2022, at 15:34, LisaHagenau @.**@.>> wrote:

Hello,

I am having some trouble with our first NanoSeq results, but I am not sure whether this is a wetlab or bioinformatics problem (though I suspect the former). I processed the data as far necessary to run the efficiency_nanoseq.pl script, which returned an extremely low duplicate rate (0.028). Below are the commmands I ran to create these results:

extract tags

python ~/src/nanoseq/bin/extract-tags.py -a data/raw/fastq/merged/S352_S1_R1.merged.fq -b data/raw/fastq/merged/S352_S1_R2.merged.fq -c data/processed/S352_extrR1.fastq -d data/processed/S352_extrR2.fastq -m 3 -s 4 -l 151

map with bwa, add rb and mb tags

bwa mem -t 12 -C /mnt/genomes/hg/hg38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna data/processed/S352_extrR1.fastq data/processed/S352_extrR2.fastq > data/processed/S352_mapped.sam

Add rc and mc tags, mark optical duplicates, create read bundle tags

bamsormadup inputformat=sam rcsupport=1 threads=12 < data/processed/S352_mapped.sam > data/processed/S352_mapped_od.bam bammarkduplicatesopt optminpixeldif=2500 threads=12 < data/processed/S352_mapped_od.bam > data/processed/S352_mapped_mdo.bam bamaddreadbundles -I data/processed/S352_mapped_mdo.bam -O data/processed/S352_filtered.bam

run randomreadinbundle to produce a deduplicated (neat) bam

randomreadinbundle -I data/processed/S352_filtered.bam -O data/processed/S352_neat.bam

index bam files

samtools index data/processed/S352_neat.bam samtools index data/processed/S352_filtered.bam

run efficiency.pl

efficiency_nanoseq.pl -d data/processed/S352_neat.bam -x data/processed/S352_filtered.bam -r /mnt/genomes/hg/hg38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna -o S352

cat S352.tsv

Whole-genome metrics:

NUM_UNIQUE_READS 302800159 NUM_SEQUENCED_READS 311577682 DUPLICATE_RATE 0.0281712186304794

RB metrics are reported for chr/contig chr1 only:

TOTAL_RBS 11799643 TOTAL_READS_IN_RBS 24611888 OK_RBS(2+2) 16452 READS_PER_RB 1.042908 F-EFF 0.008890473 EFFICIENCY 0.00111409575730232 GC_BOTH 0.3772532 GC_SINGLE 0.3769166

What doesn't quite make sense to me is that in the duplex (filtered) bam file, 244 Mio reads are marked as duplicates (out of 311 Mio, using samtools flagstat), while the deduplicated (neat) bam file contains 302 Mio reads and no marked duplicates. Wouldn't that mean that most duplicate reads are in different read bundles?

If the bioinformatic analysis is correct, then I suspect that something went wrong with the library quantification which resulted in a massive underestimation of amplifiable fragments. We used a different kit for the qPCR than described in the methods (NEBNext Library Quant), but checked that the primers that come with the kit are the same as in the KAPA kit and added the NanoqPCR primers to a final concentration of approx 330 nM. We did observe ~10x lower library yields than described in the paper even with high DNA input from freshly prepared HMW DNA from HaCaT cells (see plot). For the sequenced sample (fibroblasts) we tried 3 dilutions (1:50, 1:500 and 1:5000), but only the 1:5000 sample showed a normal amplification curve which we used for calculating the fmol input.

[lib-yield] [user-images.githubusercontent.com]https://urldefense.proofpoint.com/v2/url?u=https-3A__user-2Dimages.githubusercontent.com_13033231_166703955-2D7ff7808a-2D708d-2D49da-2Db5d3-2D2846d8860e07.png&d=DwMCaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=v9-R7fUmjpv-9Zaqyk1nlnlOC3qPkTEJz5tyYxg2uec&m=id_OBj0mifdlKNHPlzSmM3dJGs7n4bajPQJhmPoVFNV9JwzlQEp8rJJ2ks-GKWvo&s=ip2fRzqUFxe1cnDcmF551thamLwnP6GPu-t61L9lw0A&e=

Any help would be appreciated.

Thanks, Lisa

— Reply to this email directly, view it on GitHub [github.com]https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_cancerit_NanoSeq_issues_37&d=DwMCaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=v9-R7fUmjpv-9Zaqyk1nlnlOC3qPkTEJz5tyYxg2uec&m=id_OBj0mifdlKNHPlzSmM3dJGs7n4bajPQJhmPoVFNV9JwzlQEp8rJJ2ks-GKWvo&s=mQZt_QahGd44NVKg72Ux9WBeGg-6Qs9JMSd6dPcK85g&e=, or unsubscribe [github.com]https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ADNUT3JRCH4CMA5PAEMT35TVIKDIJANCNFSM5VCFVJAQ&d=DwMCaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=v9-R7fUmjpv-9Zaqyk1nlnlOC3qPkTEJz5tyYxg2uec&m=id_OBj0mifdlKNHPlzSmM3dJGs7n4bajPQJhmPoVFNV9JwzlQEp8rJJ2ks-GKWvo&s=raI8RFAbP4ZsDFjO5K4hX-JVjjVNil5Xc9xFyDdQWrg&e=. You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

LisaHagenau commented 2 years ago

Hi Fede,

thank you for the quick answer. We were aiming for 0.3 fmol input, but ended up using approx 0.2 fmol and 15 PCR cycles so that we could sequence on a NextSeq MidOutput flow cell (which usually generates 130 million clusters). From how I understood the protocol, 15x equals 150 million read pairs, which to me means 300 million reads total. But even if we oversequenced, wouldn't that mean that we get less read bundles with higher coverage and so a higher duplicate rate?

If you have high concentrations of DNA it would be advisable to make dilutions before taking fmols.

We quantified the library at 3 different dilutions (prepared serially), but unfortunately, only the 1:5000 dilution amplified properly (within standard curve range), from which we calculated a concentration of 0.017 nM.

Your yields are much lower, I am not sure why. Perhaps they are not the true yields and that’s what made things go wrong?

Yes, I think so too. From the results, particularly the RB metrics, it seems likely to me that there is actually a lot more library than we quantified and so we used too much input and too many PCR cycles leading to too many read bundles with not enough reads.

I am thinking of running the qPCR again with both primer pairs on the final library (after the PCR). Theoretically, all fragments should be amplified equally by both primer pairs, correct? If the results are too disparate, then at least we know where the problem is.

Since the issue is probably wetlab-based, should I move the discussion to the protocol exchange site or is it okay to continue here?

Best, Lisa

fa8sanger commented 2 years ago

Hi Lisa,

I don’t know much about the wet-lab side but I can put you in contact with Stef, the expert here.

Before sequencing so much next time, you could pick much fewer fmols and do shallow sequencing (MiSeq?). The ratio between sequenced reads / fmol is lineal. That would help you calibrate things on your side.

The number of PCR cycles shouldn’t matter that much, I think it’s just that you picked way more than 0.2 fmol. Looking at those duplicate rates, even much more than 2 fmols.

Best, Fede

On 5 May 2022, at 15:52, LisaHagenau @.**@.>> wrote:

Hi Fede,

thank you for the quick answer. We were aiming for 0.3 fmol input, but ended up using approx 0.2 fmol and 15 PCR cycles so that we could sequence on a NextSeq MidOutput flow cell (which usually generates 130 million clusters). From how I understood the protocol, 15x equals 150 million read pairs, which to me means 300 million reads total. But even if we oversequenced, wouldn't that mean that we get less read bundles with higher coverage and so a higher duplicate rate?

If you have high concentrations of DNA it would be advisable to make dilutions before taking fmols.

We quantified the library at 3 different dilutions (prepared serially), but unfortunately, only the 1:5000 dilution amplified properly (within standard curve range), from which we calculated a concentration of 0.017 nM.

Your yields are much lower, I am not sure why. Perhaps they are not the true yields and that’s what made things go wrong?

Yes, I think so too. From the results, particularly the RB metrics, it seems likely to me that there is actually a lot more library than we quantified and so we used too much input and too many PCR cycles leading to too many read bundles with not enough reads.

I am thinking of running the qPCR again with both primer pairs on the final library (after the PCR). Theoretically, all fragments should be amplified equally by both primer pairs, correct? If the results are too disparate, then at least we know where the problem is.

Since the issue is probably wetlab-based, should I move the discussion to the protocol exchange site or is it okay to continue here?

Best, Lisa

— Reply to this email directly, view it on GitHub [github.com]https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_cancerit_NanoSeq_issues_37-23issuecomment-2D1118652018&d=DwMFaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=v9-R7fUmjpv-9Zaqyk1nlnlOC3qPkTEJz5tyYxg2uec&m=KdVcKn4xZDzVSspmPfFmgFQaV_AFD0h7Nx2ANg0uvQ5gx-YlqZGxnsRQ7TlbOvX-&s=u8m5K0GQHUWdZZjhzdPmrQBarNu2aBFeW1jzEf__mhg&e=, or unsubscribe [github.com]https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ADNUT3IJTQCZ6Z7VOQ2WGMLVIPOCDANCNFSM5VCFVJAQ&d=DwMFaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=v9-R7fUmjpv-9Zaqyk1nlnlOC3qPkTEJz5tyYxg2uec&m=KdVcKn4xZDzVSspmPfFmgFQaV_AFD0h7Nx2ANg0uvQ5gx-YlqZGxnsRQ7TlbOvX-&s=PRWE6xx3hJ6_-toClM-_XHbGgtymxWOURkH9qc_GKs0&e=. You are receiving this because you commented.Message ID: @.***>

-- The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

LisaHagenau commented 2 years ago

Hi Fede,

thank you. I will run some PCRs to try to get to the bottom of this issue. I would appreciate any input from the wetlab expert.

Best, Lisa

fa8sanger commented 2 years ago

No problem, can you send me your email address and I will put you in contact with Stef? My email is @.**@.>

On 6 May 2022, at 11:34, LisaHagenau @.**@.>> wrote:

Hi Fede,

thank you. I will run some PCRs to try to get to the bottom of this issue. I would appreciate any input from the wetlab expert.

Best, Lisa

— Reply to this email directly, view it on GitHub [github.com]https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_cancerit_NanoSeq_issues_37-23issuecomment-2D1119478812&d=DwMCaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=v9-R7fUmjpv-9Zaqyk1nlnlOC3qPkTEJz5tyYxg2uec&m=rK3l1jYTMSJZtLlvYu6KR7d7fhxhtIsgfeRZmO2QQnDpv6Ab-dypjvckcBPt-8oz&s=RMc6OL-uUXPPTtmgJsyTetx2gBiFWVXySGuHVAKmyBU&e=, or unsubscribe [github.com]https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ADNUT3KWBGEQRKU56G7VKZLVITYT5ANCNFSM5VCFVJAQ&d=DwMCaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=v9-R7fUmjpv-9Zaqyk1nlnlOC3qPkTEJz5tyYxg2uec&m=rK3l1jYTMSJZtLlvYu6KR7d7fhxhtIsgfeRZmO2QQnDpv6Ab-dypjvckcBPt-8oz&s=ldsv55ZgkoaEZAqhJCMXWtjUWLEFYcBMjAqaeTEfbLs&e=. You are receiving this because you commented.Message ID: @.***>

-- The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

fa8sanger commented 1 year ago

Please, also check the RB tag is properly formatted: chr:position:barcode1:barcode2, just in case there was any bioinformatic problem upstream

On 4 May 2022, at 15:34, LisaHagenau @.**@.>> wrote:

Hello,

I am having some trouble with our first NanoSeq results, but I am not sure whether this is a wetlab or bioinformatics problem (though I suspect the former). I processed the data as far necessary to run the efficiency_nanoseq.pl script, which returned an extremely low duplicate rate (0.028). Below are the commmands I ran to create these results:

extract tags

python ~/src/nanoseq/bin/extract-tags.py -a data/raw/fastq/merged/S352_S1_R1.merged.fq -b data/raw/fastq/merged/S352_S1_R2.merged.fq -c data/processed/S352_extrR1.fastq -d data/processed/S352_extrR2.fastq -m 3 -s 4 -l 151

map with bwa, add rb and mb tags

bwa mem -t 12 -C /mnt/genomes/hg/hg38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna data/processed/S352_extrR1.fastq data/processed/S352_extrR2.fastq > data/processed/S352_mapped.sam

Add rc and mc tags, mark optical duplicates, create read bundle tags

bamsormadup inputformat=sam rcsupport=1 threads=12 < data/processed/S352_mapped.sam > data/processed/S352_mapped_od.bam bammarkduplicatesopt optminpixeldif=2500 threads=12 < data/processed/S352_mapped_od.bam > data/processed/S352_mapped_mdo.bam bamaddreadbundles -I data/processed/S352_mapped_mdo.bam -O data/processed/S352_filtered.bam

run randomreadinbundle to produce a deduplicated (neat) bam

randomreadinbundle -I data/processed/S352_filtered.bam -O data/processed/S352_neat.bam

index bam files

samtools index data/processed/S352_neat.bam samtools index data/processed/S352_filtered.bam

run efficiency.pl

efficiency_nanoseq.pl -d data/processed/S352_neat.bam -x data/processed/S352_filtered.bam -r /mnt/genomes/hg/hg38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna -o S352

cat S352.tsv

Whole-genome metrics:

NUM_UNIQUE_READS 302800159 NUM_SEQUENCED_READS 311577682 DUPLICATE_RATE 0.0281712186304794

RB metrics are reported for chr/contig chr1 only:

TOTAL_RBS 11799643 TOTAL_READS_IN_RBS 24611888 OK_RBS(2+2) 16452 READS_PER_RB 1.042908 F-EFF 0.008890473 EFFICIENCY 0.00111409575730232 GC_BOTH 0.3772532 GC_SINGLE 0.3769166

What doesn't quite make sense to me is that in the duplex (filtered) bam file, 244 Mio reads are marked as duplicates (out of 311 Mio, using samtools flagstat), while the deduplicated (neat) bam file contains 302 Mio reads and no marked duplicates. Wouldn't that mean that most duplicate reads are in different read bundles?

If the bioinformatic analysis is correct, then I suspect that something went wrong with the library quantification which resulted in a massive underestimation of amplifiable fragments. We used a different kit for the qPCR than described in the methods (NEBNext Library Quant), but checked that the primers that come with the kit are the same as in the KAPA kit and added the NanoqPCR primers to a final concentration of approx 330 nM. We did observe ~10x lower library yields than described in the paper even with high DNA input from freshly prepared HMW DNA from HaCaT cells (see plot). For the sequenced sample (fibroblasts) we tried 3 dilutions (1:50, 1:500 and 1:5000), but only the 1:5000 sample showed a normal amplification curve which we used for calculating the fmol input.

[lib-yield] [user-images.githubusercontent.com]https://urldefense.proofpoint.com/v2/url?u=https-3A__user-2Dimages.githubusercontent.com_13033231_166703955-2D7ff7808a-2D708d-2D49da-2Db5d3-2D2846d8860e07.png&d=DwMCaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=v9-R7fUmjpv-9Zaqyk1nlnlOC3qPkTEJz5tyYxg2uec&m=id_OBj0mifdlKNHPlzSmM3dJGs7n4bajPQJhmPoVFNV9JwzlQEp8rJJ2ks-GKWvo&s=ip2fRzqUFxe1cnDcmF551thamLwnP6GPu-t61L9lw0A&e=

Any help would be appreciated.

Thanks, Lisa

— Reply to this email directly, view it on GitHub [github.com]https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_cancerit_NanoSeq_issues_37&d=DwMCaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=v9-R7fUmjpv-9Zaqyk1nlnlOC3qPkTEJz5tyYxg2uec&m=id_OBj0mifdlKNHPlzSmM3dJGs7n4bajPQJhmPoVFNV9JwzlQEp8rJJ2ks-GKWvo&s=mQZt_QahGd44NVKg72Ux9WBeGg-6Qs9JMSd6dPcK85g&e=, or unsubscribe [github.com]https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ADNUT3JRCH4CMA5PAEMT35TVIKDIJANCNFSM5VCFVJAQ&d=DwMCaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=v9-R7fUmjpv-9Zaqyk1nlnlOC3qPkTEJz5tyYxg2uec&m=id_OBj0mifdlKNHPlzSmM3dJGs7n4bajPQJhmPoVFNV9JwzlQEp8rJJ2ks-GKWvo&s=raI8RFAbP4ZsDFjO5K4hX-JVjjVNil5Xc9xFyDdQWrg&e=. You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

LisaHagenau commented 1 year ago

Please, also check the RB tag is properly formatted: chr:position:barcode1:barcode2, just in case there was any bioinformatic problem upstream

Thank you, I just checked and the RB tag is formatted like this in the bam file:

RB:Z:chr1,11038,11177,TTT,CTT

So this could be an additional problem? I did not close the issue yet as we are still working on the quantification issue. We are getting much higher library yields with our new qPCR setup using a synthesized standard specific to the Nanoseq adapters. I hope we can confirm this by sequencing within the next two weeks.

Thanks, Lisa

LisaHagenau commented 1 year ago

Hello,

I think we have our quantification method sorted and we want to start a new sequencing run soon. But we noticed some high molecular weight smear when we ran a Bioanalyzer assay after the second PCR. I think this indicates an overamplification. The library with the low duplicate rate also shows massive overamplification (which makes sense). Is this a common observation and are these libraries okay to sequence in your opinion? I'd really appreciate your feedback.

Thanks, Lisa

221118_bioanalyzer_B1-G1

fa8sanger commented 1 year ago

Hi Lisa,

Not sure about that. Sometimes higher insert sizes indicate ligation between fragments, later resulting in lower proportions of properly paired reads. But I think I haven’t seen such large fragments, and I doubt they are PCR products. It could also be there are free adapters and you are seeing PCR recombination.

In a case like this we would just sequence and see. If your budget is tight am not sure what to advice

About your quantification method, how did you validate it? by sequencing and estimating duplicate rates/complexity of the library? If you obtained sequencing data for this it could be valuable to understand whether high molecular DNA is a problem or not.

Best, Fede

On 18 Nov 2022, at 14:13, LisaHagenau @.**@.>> wrote:

Hello,

I think we have our quantification method sorted and we want to start a new sequencing run soon. But we noticed some high molecular weight smear when we ran a Bioanalyzer assay after the second PCR. I think this indicates an overamplification. The library with the low duplicate rate also shows massive overamplification (which makes sense). Is this a common observation and are these libraries okay to sequence in your opinion? I'd really appreciate your feedback.

Thanks, Lisa

[221118_bioanalyzer_B1-G1] [user-images.githubusercontent.com]https://urldefense.proofpoint.com/v2/url?u=https-3A__user-2Dimages.githubusercontent.com_13033231_202723929-2Df63da975-2Db6d9-2D4714-2D9806-2D0f5342bd0105.png&d=DwMCaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=v9-R7fUmjpv-9Zaqyk1nlnlOC3qPkTEJz5tyYxg2uec&m=gE6iqz9uiXfS5OB4SqYYmLOAesm5t9wOoIG-cHM1gp4cylJvbeROq73IcKKgNSq8&s=1AzEtZ6TxFjOJaNNtsTKKXvfMzSorHS2l5f-DFPi3YY&e=

— Reply to this email directly, view it on GitHub [github.com]https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_cancerit_NanoSeq_issues_37-23issuecomment-2D1320047103&d=DwMCaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=v9-R7fUmjpv-9Zaqyk1nlnlOC3qPkTEJz5tyYxg2uec&m=gE6iqz9uiXfS5OB4SqYYmLOAesm5t9wOoIG-cHM1gp4cylJvbeROq73IcKKgNSq8&s=kidMuV_7mS52QJwXP4EuhtFhumbbMrDCCiiDDj-3-hw&e=, or unsubscribe [github.com]https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ADNUT3IPSAOP33554DBGN6LWI6FI5ANCNFSM5VCFVJAQ&d=DwMCaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=v9-R7fUmjpv-9Zaqyk1nlnlOC3qPkTEJz5tyYxg2uec&m=gE6iqz9uiXfS5OB4SqYYmLOAesm5t9wOoIG-cHM1gp4cylJvbeROq73IcKKgNSq8&s=mqUA6v5w4Z-low_qy8bsROnvDPz0thSDqNFnOKLyxU0&e=. You are receiving this because you commented.Message ID: @.***>

-- The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

LisaHagenau commented 1 year ago

So it turns out that it was overamplification. We skipped a dilution step during library preparation and again used way too much input DNA for the 2nd PCR, resulting in a low duplicate rate. We ordered a new flowcell and can hopefully run it next week. Third time's the charm...

LisaHagenau commented 1 year ago

Good news, the quantification and dilution worked and we finally have some promising results (though not quite optimal yet). We applied four different correction factors before PCR amplification:

0.75x 1x 1.5x 2x
NUM_UNIQUE_READS 17780921 19063518 26977953 28716908
NUM_SEQUENCED_READS 77286712 61072936 69592529 62001104
DUPLICATE_RATE 0.7699356 0.68785653 0.61234412 0.53683231
TOTAL_RBS 747958 794587 1119105 1186732
TOTAL_READS_IN_RBS 6239178 4959980 5672574 5057478
OK_RBS(2+2) 150345 116937 115603 74749
READS_PER_RB 4.170808 3.121106 2.534424 2.130843
F-EFF 0.3855591 0.3263629 0.2582081 0.2441442
EFFICIENCY 0.04016154 0.03929351 0.03396547 0.02463316
GC_BOTH 0.4029726 0.3995528 0.3952528 0.3991534
GC_SINGLE 0.4096244 0.4024825 0.4000627 0.4054076

Based on these results, I would apply a correction factor of 0.6x or so for the next library prep. The strand drop-out fraction is a bit high, but the DNA we used has been in storage for a while, so this was kind of expected.

For future reference, the quantification method that we use is based on a synthetic standard that contains the NanoSeq adapter sequences. We essentially chose a sequence from the ERCC (ERCC-00171), removed the poly-A tail and added the NanoSeq adapter sequences to the ends (link). We ordered it as a gBlock from IDT. We then used it as standard for the qPCR quantification of the NanoSeq libraries. The library yields were close to the ones reported in the paper.

Thank you for your help!

Lisa