demis001 / temposeqcount

Other
1 stars 2 forks source link

incomplete result tables are generated #26

Open Agpa88 opened 5 years ago

Agpa88 commented 5 years ago

Dear Temposeqcount Creator,

I am interested in using Temposeqcount application, however I encountered problem while executing it. I have 64 samples (fastq.gz files) to process, and whenever I run the script it seems that .fastqLog.final.txt files are not generated for all fastq.gz files (they are missing in /result/tempFolder). Consequently, resultDATA_alignment_summary.csv file contains statistics for only those fastq.gz files for which .fastqLog.final.txt files were generated. Moreover, resultDATA_COUNT_countcombined.csv contains zeros for all samples with missing .fastqLog.final.txt files (see the attachments).

Could it be that formats of input files are not correct? I am also attaching screenshot of manifest file so its format can be verified.

The strangest thing is, that whenever I run the script, I get results for only 1-2 samples and they are different at each time.

Thanks, Agnieszka

image

image

image

jshousephd commented 5 years ago

Can you please try running it with less workers? The amount of RAM allocated to each worker sometimes is insufficient for a given STAR alignment and can result in the outcome you experienced. For example, if you have 64 fastq files being aligned and you give 128gb ram to 40 workers, the 128/4 may only be enough ram for some alignments and not others.

demis001 commented 5 years ago

@Agpa88 In the mean time, would you please attach *.fastqLog.out for one of the failed sample, you will find this file under outdir/result/tempFolder

@demis001

Agpa88 commented 5 years ago

@jshousephd Thanks! I will definitely try it

@demis001 yes sure, here it is: plate5_D10.fastqLog.out.txt Thanks!

demis001 commented 5 years ago

@Agpa88

Here is the error:

EXITING: fatal error trying to allocate genome arrays, exception thrown: std::bad_alloc
Possible cause 1: not enough RAM. Check if you have enough RAM 2782013395 bytes
Possible cause 2: not enough virtual memory allowed with ulimit. SOLUTION: run ulimit -v 2782013395

I think the @jshousephd recommendation will resolve your issue.

You don't have enough memory to process multiple task at the same time. The best way is to reduce the number of cpu you assign.

Best, Dereje

demis001 commented 5 years ago

@Agpa88

Reduce the number you pass to -c CPUNUM, --cpuNum CPUNUM. Something like -c 3, I don''t know how much you used. The default is 4

Agpa88 commented 5 years ago

Dear @demis001 and @jshousephd, Solution you suggested worked! I had to pass 2 (3 was still to much) as CPUNUM argument for -cpuNum function. I have virtual Linux machine on company server, so I guess I must have very little memory allocated for it. Nevertheless, you were extremely helpful and as a newbie to GitHub and Linux, I am astonished how this community works. Thanks a lot once again!! Best, Agnieszka

Agpa88 commented 5 years ago

Dear Dereje,

Unfortunately, I encounter another instance of the “incomplete result tables are generated” problem, so I am refreshing the issue. Now I am trying to use Temposeqcount tool for Biospyder Whole Transcriptome assay. It means that I would like to process a slightly bigger FASTQ files (up to 250MB, so around 5x bigger than for S1500 assay) and feed the script with manifest file of 22.000 sequences (so around 7 times more than for S1500 assay, which usually consist up to 3.000 detection oligos sequences).

What I observe is:

  1. Program stops prematurely not entering stage 9 without giving any error (see the screenshot) terminal_screenshot

  2. File called “resultDATA_alignment_summary.csv” is not generated

  3. File called “resultDATA_COUNT_countcombined.csv” contains zeros for some samples (attachment) resultDATA_COUNT_countcombined_incomplete.xlsx

  4. Samples-specific .log.out files have incomplete logs (attachment) Example_of_unsuccessful_sample.fastqLog.txt

Since I work now on Azure Linux Virtual Machine with top parameters (128GB RAM, 64 virtual CPUs), I don’t think it’s a memory or resources allocation problem.

To get a better insight what is happening, I run several combinations of input files:

  1. 2x Small (up to 50MB) FASTQ files + small manifest (3.000 sequences) – working
  2. 2x Big (up to 250MB) FASTQ files + small manifest (3.000 sequences) – working
  3. 2x Small (up to 50MB) FASTQ files + big manifest (22.000 sequences) – NOT working
  4. 2x Big (up to 250MB) FASTQ files + big manifest (20.000 sequences) – NOT working
  5. 2x Big (up to 250MB) FASTQ files + first 2.0000 sequences from the big manifest – working

So, it seems to me that the bottle neck is increased number of sequences in manifest file that have to be processed. Do you have any ideas how to solve it? Or maybe Temposeqcount was designed for S1500 assays and cannot cope with whole transcriptome assay?

I would again appreciate your help very much. Best, Agnieszka

demis001 commented 5 years ago

@Agpa88

Would you please do this and paste the result for the failed samples?

cd outdir/result/tempFolder

The outdir is the the one you named you output directories.

Then, do:

ll in the terminal and past the output.

@demis001

Agpa88 commented 5 years ago

Here it is: image Agnieszka

demis001 commented 5 years ago

Would you please send me *.fastqLog.* for the single sample?

Dereje

On Wed, May 8, 2019 at 1:39 PM Agpa88 notifications@github.com wrote:

Here it is: [image: image] https://user-images.githubusercontent.com/30736099/57395657-f76ca080-71c8-11e9-887d-d128d4364ad3.png Agnieszka

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/demis001/temposeqcount/issues/26#issuecomment-490581035, or mute the thread https://github.com/notifications/unsubscribe-auth/ACCPKKQT3NOGO4L3KHM36FDPUMF5JANCNFSM4GWM4GAA .

Agpa88 commented 5 years ago

Isn't the one I attached up there in the previous message? Or you mean another one? Agnieszka

demis001 commented 5 years ago

@Agpa88,

I don't see any problem in the log file you sent. If you don't mind do you able to share with me the manifest csv file and the fastq file that failed through google drive. I will run it on my system to troubleshoot it. Do you see any error in *.fastqLog.progress.out?

Best, Dereje

On Thu, May 9, 2019 at 3:29 AM Agpa88 notifications@github.com wrote:

Isn't the one I attached up there in the previous message? Or you mean another one? Agnieszka

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/demis001/temposeqcount/issues/26#issuecomment-490781737, or mute the thread https://github.com/notifications/unsubscribe-auth/ACCPKKW3Y4C4J3UHCIJXG5LPUPHGDANCNFSM4GWM4GAA .

Agpa88 commented 5 years ago

Dear Dereje,

For me it seems that the endings of the logs for successful and failing samples are different (see the attachments).

Under following WeTransfer link there is everything you need to try it on your side: link

Thanks a lot, Agnieszka

demis001 commented 5 years ago

Hi Agnieszka,

I spent few hours to track the error. Here is what I found:

It looks like the scaling factor for STAR genome index didn't work for this library. Here is a quick fix. I will resolve this for the future update.

deactivate virtual env

` deactivate

Edit this file

vim temposeqcount/tasks.py

Look for this line, this is a genome index scaling factor based on lib size, for some reason 9 didn't work.

'--genomeSAindexNbases', str(scale_factor),

Change to

'--genomeSAindexNbases', str(8),

Save the file, then update installation

make install

then,

source tempseqcount/bin/activate

run your library, this will work.

`

I will send you the count if you send me your actual email, I don't want to share the actual data here.

Let me know...

Agpa88 commented 5 years ago

Dear Dereje,

It worked! It's truly awesome! I have fought with that for several weeks.

Don't bother sending me counts. Anyway I have plenty more samples to process.

So far I completed successfully run for 2 big FASTQ files + whole transcriptome manifest file. I will let you know how it works with the whole data set (24 big FASTQ files) soon.

For now I would like to thank you very, very much. I greatly appreciate that you spent so much time on troubleshooting.

Words of thanks once again, Best wishes, Agnieszka

Agpa88 commented 5 years ago

Dear Dereje,

Little update - it also worked whole data set of 24 FASTQ files + manifest with 22.000 genes. Thanks! Agnieszka

demis001 commented 5 years ago

Thank you for update, don't forget tor reference our paper in the future.

Best regards, Dereje

On Fri, May 10, 2019 at 8:25 AM Agpa88 notifications@github.com wrote:

Dear Dereje,

Little update - it also worked whole data set of 24 FASTQ files + manifest with 22.000 genes. Thanks! Agnieszka

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/demis001/temposeqcount/issues/26#issuecomment-491270563, or mute the thread https://github.com/notifications/unsubscribe-auth/ACCPKKTW2YHXXEEPADOOHG3PUVSSTANCNFSM4GWM4GAA .