aquaskyline / LRSIM

10x Genomics Reads Simulator
MIT License
45 stars 15 forks source link

Using LRSIM with LongRanger: Extremely high rate of incorrect barcodes observed (99.90 %) #34

Closed morispi closed 3 years ago

morispi commented 3 years ago

Hello,

I recently found about LRSIM which seems to be super useful for gaining better understanding of SV tools.

However, I'm trying to generate a toy dataset, and then align it to a reference with LongRanger, and LongRanger always stops and reports "stage error:Extremely high rate of incorrect barcodes observed (99.90 %). Check that input is 10x Chromium data, and that there are no missing cycles in the first 16bp of Read 1. Please note Long Ranger 2.0 and above do not support GemCode data.".

I did read from another issue that the "/1" and "/2" have to be removed from the end of the headers in order for LongRanger to work with LRSIM data, but removing them did not seem to help.

Here is the command line I'm using for generating the data: perl simulateLinkedReads.pl -r References/Ecoli.fasta -p Ecoli/SimEcoli -n -x 100 -o

I used a lower -x value because I don't need a lot of reads for now. Can it be the cause of the issue? Leaving it to the default 600 seems to generate too many reads for the toy tests I want to perform, hence why I lowered it. I also used the -o option as advised in another issue I found after a bit of searching.

Is there anything I'm doing wrong, or could you advice me how to properly use LongRanger with LRSIM data?

Thanks in advance.

Best, Pierre

aquaskyline commented 3 years ago

Lowering -x to 100 is likely to be the cause of the error. For a smaller dataset, you might try simulating a larger dataset using the default parameters of LRSIM and then use a subset of the simulated reads randomly sampled from the whole fastq file. LongRanger checks the distribution and depth evenness of the barcode used very stringently. Changing the default parameters in LRSIM and using -o to disable parameter checking can make the simulated dataset look unreliable to LongRanger.

morispi commented 3 years ago

Thanks for your answer!

Yeah, I could also do that, but the reason I lowered -x was actually because the size of the data generated was getting pretty big. I'm not exactly sure how far I got through the simulation process, but it grew up to a little more than 500 GB. Since I don't have access to lots of disk space, I thought lowering -x was a good compromise.

I'll try running again and leave -x at its default value then. Do you have any idea how much disk space it is gonna use in total, when running on E. coli? I would just like to be sure it's not gonna fully fill the available disk space I have left, since I have other experiments running in parallel, and that also require a little disk space.

Thanks again.

Pierre

aquaskyline commented 3 years ago

You could try running the test.sh in the test folder, it provides an example on Ecoli. The Ecoli reference is already in the folder so what you need to do is just to run the test.sh script.

morispi commented 3 years ago

I did run a full experiment with default parameters on E. coli last night. It ran successfully in a few hours and needed around 700 Go of disk space to run. However, I did not use the parameters specified in test.sh because I did not want any SV to be included in the data (it might sound weird, but I'm interested in seeing how SV-callers tools, especially the one I'm working on, behave on datasets with no SVs). The command I used was the following: perl simulateLinkedReads.pl -r Ecoli.fasta -p /scratch/pmorisse/LRSIM/Ecoli/SimEcoli -n

I then used seqtk to randomly subsamble the fastq file, and performed LongRanger alignment with the subsambled fastq files I thus generated. The total size of the fastq files was around 7 GB, which seems like a reasonable coverage for a small test experiment.

However, I still got the same error, and LongRanger reported that a extremely high rate of incorrect barcodes was observed.

Am I forced to perform LongRanger alignment with the whole 700 GB fastq file generated with LRSIM? I'm afraid I won't have enough disk space if I have to do so. Or might it be because I deactivated SV simulation?

aquaskyline commented 3 years ago

I suggest you to test run the test.sh first to see if it goes through LongRanger.

On Tue, Dec 8, 2020 at 11:46 PM Pierre Morisse notifications@github.com wrote:

I did run a full experiment with default parameters on E. coli last night. It ran successfully in a few hours and needed around 700 Go of disk space to run. However, I did not use the parameters specified in test.sh because I did not want any SV to be included in the data (it might sound weird, but I'm interested in seeing how SV-callers tools, especially the one I'm working on, behave on datasets with no SVs). The command I used was the following: perl simulateLinkedReads.pl -r Ecoli.fasta -p /scratch/pmorisse/LRSIM/Ecoli/SimEcoli -n

I then used seqtk to randomly subsamble the fastq file, and performed LongRanger alignment with the subsambled fastq files I thus generated. The total size of the fastq files was around 7 GB, which seems like a reasonable coverage for a small test experiment.

However, I still got the same error, and LongRanger reported that a extremely high rate of incorrect barcodes was observed.

Am I forced to perform LongRanger alignment with the whole 700 GB fastq file generated with LRSIM? I'm afraid I won't have enough disk space if I have to do so. Or might it be because I deactivated SV simulation?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/aquaskyline/LRSIM/issues/34#issuecomment-740700813, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG2SOKF3TWWEZMMDCHZL2LSTZC43ANCNFSM4UNSYAHQ .

-- Laurent

morispi commented 3 years ago

I just run test.sh and provided the generated data to LongRanger. It crashed again, and output a different error message:

Log message: stage error:FASTQ parsing error: input fastq not consistent

aquaskyline commented 3 years ago

It ran well on my side. I uploaded the files generated at http://www.bio8.cs.hku.hk/lrsim/.

morispi commented 3 years ago

Just downloaded and tested with your data, and got the same error. Might be something to do with LongRanger I guess? Can you tell me which version you are using?

aquaskyline commented 3 years ago

LRSIM was tested on LongRanger 2.0

On Thu, Dec 10, 2020 at 7:59 PM Pierre Morisse notifications@github.com wrote:

Just downloaded and tested with your data, and got the same error. Might be something to do with LongRanger I guess? Can you tell me which version you are using?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/aquaskyline/LRSIM/issues/34#issuecomment-742477050, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG2SOK3C2SRYS7VUUQF6ILSUCZ2BANCNFSM4UNSYAHQ .

-- Laurent

morispi commented 3 years ago

That might be why, I'm using LongRanger 2.2.2. LongRanger 2.0 does not seem to be available for download on 10x genomics website though.

morispi commented 3 years ago

I managed to pin down the problem.

As mentioned in a previous issue, this was caused by the "/1" and "/2" located at the end of the reads simulated by LRSIM, which seem to be incompatible with LongRanger. Removing them and re-rerunning LongRanger seemed to fix the problem with the data generated by the test.sh script.

I also tried generated more data, using most of the parameters mentioned in test.sh, but deactivating SV simulation, and all seems to work well. LongRanger is still running, but did not report any error.

I believe my initial with the high rate of incorrect barcodes was due to the fact I was using -x 1 without decreasing the -t parameter in accordance.

aquaskyline commented 3 years ago

That's great. I was trying to pinpoint the problem but focused too much on the barcode list.

On Mon, Dec 14, 2020 at 11:55 PM Pierre Morisse notifications@github.com wrote:

I managed to pin down the problem.

As mentioned in a previous issue, this was caused by the "/1" and "/2" located at the end of the reads simulated by LRSIM, which seem to be incompatible with LongRanger. Removing them and re-rerunning LongRanger seemed to fix the problem with the data generated by the test.sh script.

I also tried generated more data, using most of the parameters mentioned in test.sh, but deactivating SV simulation, and all seems to work well. LongRanger is still running, but did not report any error.

I believe my initial with the high rate of incorrect barcodes was due to the fact I was using -x 1 without decreasing the -t parameter in accordance.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/aquaskyline/LRSIM/issues/34#issuecomment-744533261, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG2SOPDJXHVNYSB7PUMWNDSUYYO3ANCNFSM4UNSYAHQ .

-- Laurent

morispi commented 3 years ago

Closing since the problem is solved.