Problem in load a big number of libraries in zigzag object

SFonsecaCosta commented 4 years ago

Hello @ammonthompson

I'm trying to load around 300 libraries in a zigzag object and I got this error:

Error in if (sum(sorted_peaks_byd2y < 0) > 0) { : missing value where TRUE/FALSE needed

I think the error is related with huge number of libraries that I'm trying to import (please correct me if I'm wrong).
I'm starting wondering if exist some limitation in the number of the libraries that can be imported to the zigzag object? I prior I assumed that no, but now I'm not sure anymore, since all tests I did before work quite well (low number of libraries) but with huge amount of libraries seems to have some issue.

I hope you can help me to clarify this error, please.

Best regards

Sara

ammonthompson commented 4 years ago

HI Sara,

Thank you for contacting me.

Yes it should work with 300 libraries, but it will be extremely slow I'm afraid. I'm currently working on a way of calculating the likelihood such that the number of libraries doesn't slow it down so much. I find that ten libraries is more than enough to get robust estimates. Maybe you can subsample 10-30 libraries and see if results are robust?

It looks like the error is from the auto-threshold finding algo. A quick solution is to manually place your prior thresholds for component means doing an empirical Bayesian analysis. See the section "Load data into a zigzag object and set hyperparameters" in the tutorial here: https://ammonthompson.github.io/zigzag/

for example, I usually preliminarily specify:

zigzag$new(expression_data, gene_length_data, num_active_components = 2, threshold_a = c(0, 5), threshold_i = 0)

Checking the sensitivity of the component mean posterior to these thresholds is essential.

Does this help?

On Thu, Sep 10, 2020 at 10:43 AM Sara Fonseca Costa < notifications@github.com> wrote:

Hello @ammonthompson https://github.com/ammonthompson

I'm trying to load around 300 libraries in a zigzag object and I got this error:

Error in if (sum(sorted_peaks_byd2y < 0) > 0) { : missing value where TRUE/FALSE needed

I think the error is related with huge number of libraries that I'm trying to import (please correct me if I'm wrong). I'm starting wondering if exist some limitation in the number of the libraries that can be imported to the zigzag object? I prior I assumed that no, but now I'm not sure anymore, since all tests I did before work quite well (low number of libraries) but with huge amount of libraries seems to have some issue.

I hope you can help me to clarify this error, please.

Best regards

Sara

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ammonthompson/zigzag/issues/5, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAW4LUMDD43W3Q3ZHCF2HTTSFEBIXANCNFSM4RFOFO3A .

-- Ammon Thompson, Ph.D. Postdoctoral Fellow University of California, Davis Dept. Evolution and Ecology One Shields Ave Davis, CA 95616

SFonsecaCosta commented 4 years ago

Hi @ammonthompson

Thanks for your answer and availability.

Yes it should work with 300 libraries, but it will be extremely slow I'm afraid. I'm currently working on a way of calculating the likelihood such that the number of libraries doesn't slow it down so much. I find that ten libraries is more than enough to get robust estimates. Maybe you can subsample 10-30 libraries and see if results are robust?

I would like to combine multiple libraries to be more confident but I also can see that can be really slow. But I followed your suggestion and I did a subsample (choosing randomly 10 and 30 sample from this ~300 libraries). As you can see from the plots the results differ and the subsample seems not to be the best approach. If you see for example the biotype - protein coding the median is clear affected by the number of libraries. Also the % of protein coding genes detected as present for the probability of 95%, as expected, is affected: sudsamples 10 = 26.73258% and subsamples 30 = 14.4004 %

frequency Screen Shot 2020-09-12 at 11 23 19

Note: I used sample_frequency = 50 and the number of generations 5000 (burn) and ngen = 10000 for MCMC.

In relation to import the 300 libraries to the zigzag object:

I used initially after plot the distributions of all samples: my_zigzag = zigzag$new(data, gene_length, output_directory, num_active_components = 2, threshold_a = c(1, 4)) by no specifying the threshold_i. so I could solve the initial error adjusting the auto-threshold tothreshold_i = 0 as you suggested, thanks.

Best regards

Sara

ammonthompson commented 4 years ago

It could be a number of things. Subsampling error; 10 - 30 may be too few. If you compare multiple subsamples of 30 you find less error. If not, subsample more. Or it could be MCMC error; you may need to run the chain longer. What are the effective sample sizes of your posterior sample in the log files? I would be happy to look over your log files and the plots in your mcmc_report directory to help diagnose your MCMC's. You can email me at ammonthompson@gmail.com

SFonsecaCosta commented 4 years ago

Yes, I agree that can be multiple things. But, I don't think this is a subsampling error. The test about multiple subsamples of 30 can be a good test indication (I will test that). I really think the chain should be longer (so I guess huge number of libraries requires more long chains in the MCMC, on the other hand this will increase the zigzag time). For the last run the sample size is 200 (I will share the files with you via email). Thanks

ammonthompson commented 4 years ago

Hi, I've tested zigzag out on some very large library datasets ( > 300 ) and was able to run it fine. Is it possible one of the libraries in your dataset has an unexpected character? Let me know if this is nolonger an issue fore you. Thank you.

SFonsecaCosta commented 4 years ago

Hi @ammonthompson ,

Sorry for my delay, regarding the subsampling as I told you I still can detect some differences dependent of the sub samples selected. In the dataset that I sent you (30 samples) I agree that the libraries are variable between them (this is also one of my interested in using zigzag, to see how this can be managed). The range of zeros are between 12000 to 28000 from library 1 to 26 and from 27 to 30 the range is between 31000 to 43000 of zeros. In this case we know this can be related with low coverage and running a posterior predictive simulation should be not necessary at least for the library with 43k, you agree with me? But I can discuss this with you via email.

About the big data set, yes I was able to load the 331 libraries in the zigzag object, this is the estimation time: user system elapsed 4645.925 256.731 4797.792

I found that when you load a big amount of libraries you need to be careful also with initial parameters provided for the next steps (make sense but can increase also the time), for example in the burnin step:

burnin <- my_zigzag$burnin(sample_frequency = 50, progress_plot = FALSE, write_to_files = TRUE, ngen=20000) Error in if (log(runif(1)) < R & R != Inf)

This happen with you in your test?

Thanks

ammonthompson commented 4 years ago

Hi Sara, I haven't seen that issue during burnin. I think your guess is a good one, but I would need to work with your data to be sure. Happy to discuss this with you over email. If it is ok with you, I will close this issue since the problem appears to be data-specific.

SFonsecaCosta commented 4 years ago

Hi @ammonthompson

Yes sure and I think would be better if we discuss zigzag performance and data-specificity over email.

ammonthompson / zigzag

Problem in load a big number of libraries in zigzag object #5