Lexogen-Tools / idemux

Idemux is a command line tool designed to demultiplex paired-end fastq files from QuantSeq-Pool.
https://www.lexogen.com/
Other
8 stars 1 forks source link

High number of undetermined reads #6

Closed jaanckae closed 9 months ago

jaanckae commented 2 years ago

Using idemux results in a high number of reads being assigned to the undetermined_R1 and undetermined_R2 fastq files. Is this expected behaviour or are additional settings necessary to avoid this?

Used commands are:

idemux --r1 00_bcl2fastqout/Undetermined_S0_R1_001.fastq.gz --r2 00_bcl2fastqout/Undetermined_S0_R2_001.fastq.gz --sample-sheet sample_sheet.csv --out 01_idemuxout/ --i5-rc

with a sample sheet that looks like this

` sample_name,i7,i5,i1

B,CGGGAACCCGCA,AGGGCCAAAGAC,CGGGAACCCGCA C,CGGGAACCCGCA,AGGGCCAAAGAC,AAACGTTCATCC D,CGGGAACCCGCA,AGGGCCAAAGAC,TTGTCCGATATG ... `

FalkoHof commented 2 years ago

Hey @jaanckae, I have pasted some shell command lines below to count the observed barcodes in your undetermined reads file. Could you please run these commands on the by idemux undetermined reads output and post the results here?

# this will make generate a text file with the observed i7,i5 combinations
zcat undetermined_R1.fastq.gz | paste - - - - | awk '{if(map[$2]){map[$2]+=1}else{map[$2]=1}}END{for(m in map){print m,map[m]}}' | sort -k2,2nr > barcode_combinations.counts
# there we do a bit of reformatting and sorting to get the sorted i5 counts
cat barcode_combinations.counts | awk '{A=substr($1,length($1)-11,12); if(map[A]){map[A]+=$2}else{map[A]=$2}}END{for(m in map){print m,map[m]}}' | sort -k2,2nr > barcode_i5.counts
# here for the same for the i7
cat barcode_combinations.counts | awk '{A=substr($1,7,12);if(map[A]){map[A]+=$2}else{map[A]=$2}}END{for(m in map){print m,map[m]}}' | sort -k2,2nr > barcode_i7.counts
jaanckae commented 2 years ago

Hi Falko

You can find the files created by the commands you provided in attach.

barcode_i5_counts.txt barcode_i7_counts.txt

FalkoHof commented 2 years ago

Dear @jaanckae thanks! Could you please also upload the barcode_combinations.counts? Also just to clarify again, these commands have been run on the undetermined reads output produced by idemux? I am just asking because the most abundant barcodes seem to be the ones you defined in the sample definition sheet.

jaanckae commented 2 years ago

Hi Falko

The commands have indeed been run on the output produced by idemux. The barcode_combinations file can be found here

FalkoHof commented 2 years ago

Hey @jaanckae, thanks for the files!

Looking at the top 3 barcode combinations I see:

  1. CGGGAACCCGCA+GTCTTTGGCCCT (74499017 counts).
  2. AAACGTTCATCC+TTAGTAACTGGG (51282225 counts)
  3. GGGGGGGGGGGG+AGATCTCGGTGG (16059858 counts)

Now 1 and 2 seem to be valid Lexogen i7/i5 barcode combinations. 1 is the combination you specified above in the sample sheet while 2 is a different combination. This suggests to me that when these reads are ending up in the undetermined reads bin something is going wrong with the i1, as otherwise this should have been demultiplexed correctly. 3 seems to be a mix of absence of signal (the polyG) and a non Lexogen barcode.

Could you also please post the full sample sheet you are feeding into idemux and elaborate a bit on what has been multiplexed in this run?

If you contact us at support@lexogen.com we can also arrange that you can upload your fastq files to one of our servers. I can then have a look at it in more detail if that would be alright with you.

jaanckae commented 2 years ago

Hi Falko

This is the sample sheet

sample_sheet.csv

Our data consists of single muscle fiber RNA-seq data with following setup

48 samples, prepped in 2 pools of 24

Indexing

I'll have to check if uploading fastq files to your server(s) is possible from our side.

FalkoHof commented 2 years ago

Thanks for the sample sheet. I just checked it and everything seems fine.

In order for me to debug this though I would need to have a look at the fastq files. Otherwise I can't really tell whats going on. As a company we are ofc GDPR compliant, so please clarify if you could share that data with us.

As mentioned before I suspect that the issue is related to the i1 index, as the top 1 combination is a valid i7/i5 combination that you also specified in your sample sheet. Now the scripts I send you don't count for the i1 frequency, so I need to update this so you can also count the most prominent i7/i5/i1 combinations.

My current hypothesis is that (also because you seem to loose only ~ 20% of your total reads):

  1. More combinations of and i1 with CGGGAACCCGCA+GTCTTTGGCCCT and AAACGTTCATCC+TTAGTAACTGGG that are not specified in the sample sheet are present in your mix.
  2. Some samples received a different i1 than expected
  3. Somehow the required i1 correction map does not get loaded by idemux, and it does not perform error correction.

Could you maybe also post your demultiplexing statistics and the logs that idemux prints in the beginning of the demultiplexing to the console (just the print out where it tells you which barcode sets are loaded for error correction)? Are there some samples that did receive no, or nearly no reads but should have received some?

jaanckae commented 2 years ago

Hi Falko

Finally got the clear to upload some of our fastq files to your servers. Should I contact support@lexogen.com for this?

atuerk commented 2 years ago

Hi @jaanckae,

my colleague @FalkoHof is on vacation today. I will therefore setup an account for you on our sftp server. We usually send username and password via different channels. Can you, therefore, please send me an email address where I can send you the password.

Best, Andreas

jaanckae commented 2 years ago

You can contact me at jasper.anckaert@ugent.be

jaanckae commented 2 years ago

Here are the demultiplexing statistics and a part of the log:

demultipexing_stats.txt

2022-01-03 10:06:27 INFO     Trying to find the appropriate barcode set for i7...
2022-01-03 10:06:27 INFO     Correct set found. Used set is 96 barcodes with 12 nt length.
2022-01-03 10:06:27 INFO     Trying to find the appropriate barcode set for i5_rc...
2022-01-03 10:06:27 INFO     Correct set found. Used set is 96 barcodes with 12 nt length.
2022-01-03 10:06:27 INFO     Trying to find the appropriate barcode set for i1...
2022-01-03 10:06:27 INFO     Correct set found. Used set is 96 barcodes with 12 nt length.
2022-01-03 10:06:27 INFO     Peeking into fastq files to check for barcode formatting errors
2022-01-03 10:06:27 INFO     Checking fastq input files...
2022-01-03 10:06:27 INFO     Input file formatting seems fine.
2022-01-03 10:06:27 INFO     Starting demultiplexing

There is, as far as I can see, no mention of a file that is used for error correction. All samples received reads.

atuerk commented 2 years ago

Dear Jasper,

I created your account on our sftp server, which can be accessed at the following address.

sftp://185.106.248.148 port: 22 username: jasper.anckaert

I will send you the password separately via email in the next 5 minutes. Please check your junk mail folder if you cannot find my mail.

For accessing the server we recommend using filezilla. Please upload your data into the upload folder.

Cheers, Andreas

jaanckae commented 2 years ago

Hi Andreas

Thanks for the support. Should I upload the fastq files before or after idemux has been run? I was just wondering if there is a maximum combined file size. Our raw fastq (before idemux) files are more than 50G in total.

Best Jasper

atuerk commented 2 years ago

Hi Jasper,

50GB is not a problem. However, @FalkoHof should tell you what he needs to analyse your problem.

Cheers, Andreas

FalkoHof commented 2 years ago

Hi Jasper, please upload the files before demultiplexing and the samplesheet you used. I can then run the tool and use the output to investigate what's going on. Best wishes, Falko

jaanckae commented 2 years ago

Hi Falko

Files have been uploaded.

Best Jasper

FalkoHof commented 2 years ago

Thanks Jaspar! I will have a look and let you know what I find. Best, Falko

jaanckae commented 2 years ago

Hi Falko

Any updates on the issue?

Thanks in advance Jasper

FalkoHof commented 2 years ago

Dear Jasper, I did run your data yesterday through idemux and am currently investigating the issue. If you want I can send you the demultiplexing stats and you can check if these results match up with yours.

The next step would now for me to check what exactly is left in the undetermined reads files. I am currently quite busy, so this might take me a few days. However, I will let you know as soon as I have an update.

Best, Falko

jaanckae commented 2 years ago

That would be great thanks.

FalkoHof commented 2 years ago

Hey Jasper, just to let you know I am currently implementing a new feature, where the number of observed barcode combinations and their respective counts in the undetermined reads file will be output to tsv. I am testing this currently with your data and will be releasing it soon (until the end of the week). Once I am satisfied with the output I will sent it to you and give you some detailed feedback on what is in the undetermined reads file and what could have happend there. Best wishes, Falko

FalkoHof commented 2 years ago

Dear Jaspar, see below for a quick overview on the barcode distribution in your undetermined reads file. I have also copied some files with more diagnostic info over to the sftp account we made for you. You can find these in the folder 'barcode_stats'.

summary statistics

valid_i7 valid_i5 valid_i1 read_counts
TRUE TRUE FALSE 133758239
TRUE FALSE TRUE 24505734
FALSE FALSE FALSE 22121539
FALSE TRUE TRUE 13507433
TRUE TRUE TRUE 7532011
TRUE FALSE FALSE 5457087
FALSE TRUE FALSE 3438461
FALSE FALSE TRUE 862013

the different combinations

the sequences

discussion

Best wishes, Falko

jaanckae commented 2 years ago

Hi Falko

Can I just come back to the issue with the barcode correction? How can I check/be sure that the file for error correction is loaded?

Thanks again Jasper

FalkoHof commented 2 years ago

Hey Jasper, idemux prints some logs into the console. As can be seen in the output you posted above.

2022-01-03 10:06:27 INFO     Trying to find the appropriate barcode set for i7...
2022-01-03 10:06:27 INFO     Correct set found. Used set is 96 barcodes with 12 nt length.
2022-01-03 10:06:27 INFO     Trying to find the appropriate barcode set for i5_rc...
2022-01-03 10:06:27 INFO     Correct set found. Used set is 96 barcodes with 12 nt length.
2022-01-03 10:06:27 INFO     Trying to find the appropriate barcode set for i1...
2022-01-03 10:06:27 INFO     Correct set found. Used set is 96 barcodes with 12 nt length.
2022-01-03 10:06:27 INFO     Peeking into fastq files to check for barcode formatting errors
2022-01-03 10:06:27 INFO     Checking fastq input files...
2022-01-03 10:06:27 INFO     Input file formatting seems fine.
2022-01-03 10:06:27 INFO     Starting demultiplexing

These tell you which of the barcode correction maps have been loaded. The appropriate sets can be found here in these folders.

The current build/commit on the master branch is failing because I was trying to incorporate some updates, which I have not yet manged to fully roll out. But if you check e.g the commit history, the test before May 3, 2022 which are associated with the 0.1.6 release all passed. We ofc don't test everything via automated testing but if you have some suggestions on what has been missed or should be implemented I am happy to give this a try.

I have listed the relevant parts where we test if the correct barcode maps are loaded and if the demultiplexing works as expected:

To be fair in the test where I check if the barcode maps can be loaded, I only test the i7 map, not i5, and i1 map can be loaded and returns the correct values. However, the second link contains the tests on synthetic data for which in order to succeed also the correct barcode sets need to be loaded.

Alternatively, how about we set up a call and I quickly walk you through relevant parts in the code? As mentioned before the main algorithm is not very complicated and the code is quite simple.

Which version are you using btw? the 0.1.6 version?

Best, Falko