abelson-lab / scATOMIC

Pan-Cancer Single Cell Classifier
MIT License
57 stars 5 forks source link

CNV mode bug #18

Closed sabrina0701 closed 9 months ago

sabrina0701 commented 9 months ago

Hi team,

I have run scATOMIC with or without CNV mode, it work will without CNV mode, but error happened with CNV mode. Do you know why this happened? Thanks!

The information please find below:

results_Kildisiute<- create_summary_matrix(prediction_list = cell_predictions, use_CNVs = T, modify_results = T, mc.cores = 10, raw_counts = sparse_matrix, min_prop = 0.5, known_cancer_type = "Neuroblastoma" )

[1] "step1: read and filter data ..." [1] "56255 genes, 35779 cells in raw data" [1] "11384 genes past LOW.DR filtering" [1] "step 2: annotations gene coordinates ..." [1] "start annotation ..." [1] "step 3: smoothing data with dlm ..." [1] "step 4: measuring baselines ..." [1] "6531 known normal cells found in dataset" [1] "run with known normal..." [1] "baseline is from known input"

Error in norm.mat.relat[which(DR2 >= UP.DR), ] : subscript out of bounds Calls: create_summary_matrix -> In addition: Warning messages: 1: In asMethod(object) : sparse->dense coercion: allocating vector of size 15.0 GiB 2: In asMethod(object) : sparse->dense coercion: allocating vector of size 15.0 GiB 3: In asMethod(object) : sparse->dense coercion: allocating vector of size 2.9 GiB 4: In parallel::mclapply(1:ncol(norm.mat), dlm.sm, mc.cores = n.cores) : scheduled cores 1, 6 did not deliver results, all values of the jobs will be affected 5: In matrix(unlist(test.mc), ncol = ncol(norm.mat), byrow = FALSE) : data length [260423360] is not a sub-multiple or multiple of the number of rows [7898] Execution halted

sabrina0701 commented 9 months ago

one more question, after I run CMV model, why some cells with aneuploid CNV haven't predicted as tumours, since I thought aneuploid CNV should be tumour cells. Many thanks!

266801958-bddc382f-8d88-4bfe-9c0b-58a7b7bee119
inofechm commented 9 months ago

Hi Sabrina, Thanks for your interest in our work! Regarding your first question, it looks like there is some sort of bug occuring in CopyKAT, specifically with some of your multiple cores not delivering results. I would try running it again with more memory and fewer cores, or even just single core. Also I noticed you are running on a sample of ~36000 cells. Is this one patient or multiple? For copykat and scATOMIC it is important to split your dataset into per sample count matrices and run the workflow on each sample separately. Unfortunately I cannot really further debug your bug as it is happening in the copyKAT backend so I would consider opening an issue in their github repo, although it seems they are not actively maintaining or responding to issues...

Regarding your second question, the CNV mode does not modify scATOMIC predictions, it only adds another level of results on whether the cell is aneuploid. in this case scATOMIC's gene expression based prediction does not agree with copykat's CNV prediction for these cells. In this case I recommend projecting the cells on a UMAP and trying to make that decision manually on which is more correct. If you see that there are cancer cells in a different cluster than the normal tissue cells I would usually think that one is cancer and one is normal. If they are in the same cluster I would go with the CNV prediction instead...

Let me know if this helps and I'll close the issue.

sabrina0701 commented 9 months ago

Hi Ido,

Really appreciate for your reply.

The first question, currently for some cohort I have already run scATOMIC on the whole cohort than separate sample, will this affect the predictor results. In addition, for some samples, there are more than 45000 cells in one sample which lead to fail, so I separated the sample to two matrix, does this will affect the results.

For the CNV model, you mean that the CNV mode will not change the scATOMIC predictions, only add one CNV column in the final results, am I right? But I have run with or without CNV mode on the same dataset which the scATOMIC predictions are different. Sorry for the silly question, I have to since I want to extract tumor cells accordingly to your scATOMIC package.

Many thanks.

Best wishes, Sabrina

Ido Nofech-Mozes @.***> 于2023年9月11日周一 14:22写道:

Hi Sabrina, Thanks for your interest in our work! Regarding your first question, it looks like there is some sort of bug occuring in CopyKAT, specifically with some of your multiple cores not delivering results. I would try running it again with more memory and fewer cores, or even just single core. Also I noticed you are running on a sample of ~36000 cells. Is this one patient or multiple? For copykat and scATOMIC it is important to split your dataset into per sample count matrices and run the workflow on each sample separately. Unfortunately I cannot really further debug your bug as it is happening in the copyKAT backend so I would consider opening an issue in their github repo, although it seems they are not actively maintaining or responding to issues...

Regarding your second question, the CNV mode does not modify scATOMIC predictions, it only adds another level of results on whether the cell is aneuploid. in this case scATOMIC's gene expression based prediction does not agree with copykat's CNV prediction for these cells. In this case I recommend projecting the cells on a UMAP and trying to make that decision manually on which is more correct. If you see that there are cancer cells in a different cluster than the normal tissue cells I would usually think that one is cancer and one is normal. If they are in the same cluster I would go with the CNV prediction instead...

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

inofechm commented 9 months ago

Hi Sabrina,

currently for some cohort I have already run scATOMIC on the whole cohort than separate sample, will this affect the predictor results.

Yes this will make the results less accurate for malignant and normal tissue cells. It shouldnt have too much of an effect on blood or stromal cells. I recommend only running scATOMIC on each sample separately.

there are more than 45000 cells in one sample which lead to fail, so I separated the sample to two matrix, does this will affect the results.

Splitting the matrix may have a slight effect but not too much as long as the split is relatively random in terms of the cell types in each split (which you would obviously not know before hand...), one thing you could do to aid in this is just cluster the entire dataset woth seurat and when you split, ensure 50% of each cluster contributes to each split. If possible I would recommend using a high performance cluster if you have so many cells per sample to avoid splitting data.

But I have run with or without CNV mode on the same dataset which the scATOMIC predictions are different.

Are the results very different or just slightly? I am wondering if potentially there is a step where a seed isnt being set... in any case if the results are slightly different it is not because of the CNV mode and might be a function of just running scATOMIC twice, although I think that shouldnt be happening so perhaps there is a bug I need to look for.

Thanks for your interest in using scATOMIC!

sabrina0701 commented 9 months ago

Hi Ido,

Thanks for your reply. Without and with CNV mode, the number of tumor cells are 10998 and 14642, respectively, and there are also slightly different for blood cells. Yes, I haven't set any seed, but in the github, I also haven't seen which step should set seed.

Best wishes, Sabrina

Ido Nofech-Mozes @.***> 于2023年9月12日周二 14:36写道:

Hi Sabrina,

currently for some cohort I have already run scATOMIC on the whole cohort than separate sample, will this affect the predictor results. Yes this will make the results less accurate for malignant and normal tissue cells. It shouldnt have too much of an effect on blood or stromal cells. I recommend only running scATOMIC on each sample separately.

there are more than 45000 cells in one sample which lead to fail, so I separated the sample to two matrix, does this will affect the results. Splitting the matrix may have a slight effect but not too much as long as the split is relatively random in terms of the cell types in each split (which you would obviously not know before hand...), one thing you could do to aid in this is just cluster the entire dataset woth seurat and when you split, ensure 50% of each cluster contributes to each split. If possible I would recommend using a high performance cluster if you have so many cells per sample to avoid splitting data.

But I have run with or without CNV mode on the same dataset which the scATOMIC predictions are different.

Are the results very different or just slightly? I am wondering if potentially there is a step where a seed isnt being set... in any case if the results are slightly different it is not because of the CNV mode and might be a function of just running scATOMIC twice, although I think that shouldnt be happening so perhaps there is a bug I need to look for.

Thanks for your interest in using scATOMIC!

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

inofechm commented 9 months ago

To clarify, I didn't mean you need to set a seed, I meant there may be a step in the code backend where there is not a seed being set so I would need to find that. In terms of the the 10000 and 14000 cells, that doesn't seem negligible so I'll try figure out what may be going on. Could you clarify if they have similar layer_6 values? I want to see if it is in the first classification steps or in the create_summary_matrix function.

sabrina0701 commented 9 months ago

Riemondy.csv https://drive.google.com/file/d/1SdO2LhCeGbuFWii6A0FqdCaeRxFb8b75/view?usp=drive_web

Riemondy_CNV.csv https://drive.google.com/file/d/1btc2OnZhFSbcWbPhVGJOCraRPGFl8JOm/view?usp=drive_web

Hi team, I tried to share the results, would you would open it.

Best wishes, Sabrina

Ido Nofech-Mozes @.***> 于2023年9月12日周二 15:57写道:

To clarify, I didn't mean you need to set a seed, I meant there may be a step in the code backend where there is not a seed being set so I would need to find that. In terms of the the 10000 and 14000 cells, that doesn't seem negligible so I'll try figure out what may be going on. Could you clarify if they have similar layer_6 values? I want to see if it is in the first classification steps or in the create_summary_matrix function.

— Reply to this email directly, view it on GitHub https://github.com/abelson-lab/scATOMIC/issues/18#issuecomment-1715890390, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVF2QL33VZP7Y5A7HGNWN7TX2BZ5FANCNFSM6AAAAAA4RBTHTU . You are receiving this because you authored the thread.Message ID: @.***>

inofechm commented 9 months ago

Yeah so it looks like we are getting more normal tissue cells in one run, this shouldnt have to do with the CNV mode, rather just a function of repeating the analysis twice. There is a bit of randomness involved and I need to pin point where this is occurring exactly. In the meantime please run each sample separately, I see from the results that column 'orig.ident' is indicating different sample IDs... you need to split them to have a count matrix for each of these IDs before running scATOMIC as this can lead to issues with deciding whether cells are cancer or not.

In any case if I were you I would cluster the cells and if you see mixed clusters of mostly aneuploid and scATOMIC_pred= 'medulloblastoma' just assume most of these cells are cancer cells.

sabrina0701 commented 9 months ago

Thanks for your reply, would I kindly check whether I have to split the data into single sample? Since it's a pan-cancer study which I already run for weeks, if this is the case, I think I have to rerun the whole atlas. Many thanks in advance!

Best wishes Sabrina

Ido Nofech-Mozes @.***> 于2023年9月12日周二 16:17写道:

Yeah so it looks like we are getting more normal tissue cells in one run, this shouldnt have to do with the CNV mode, rather just a function of repeating the analysis twice. There is a bit of randomness involved and I need to pin point where this is occurring exactly. In the meantime please run each sample separately, I see from the results that column 'orig.ident' is indicating different sample IDs... you need to split them to have a count matrix for each of these IDs before running scATOMIC as this can lead to issues with deciding whether cells are cancer or not.

In any case if I were you I would cluster the cells and if you see mixed clusters of mostly aneuploid and scATOMIC_pred= 'medulloblastoma' just assume most of these cells are cancer cells.

— Reply to this email directly, view it on GitHub https://github.com/abelson-lab/scATOMIC/issues/18#issuecomment-1715926762, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVF2QL4TY65CV4T7Q3HLSNLX2B4KHANCNFSM6AAAAAA4RBTHTU . You are receiving this because you authored the thread.Message ID: @.***>

inofechm commented 9 months ago

Hi Sabrina, For both scATOMIC, copyKAT (backend of CNV mode), and any other CNV method (eg, inferCNV) the models make assumptions that there is only one cancer sample at a time. If you want an accurate prediction of malignant vs non malignant in cell types that are not immune or stromal cells and unique normal tissue cells then it is indeed important to run it per sample. If you are working on a high performance cluster you should be able to run scATOMIC annotation jobs for each sample and it really shouldnt take a long time, definitely not weeks... On our servers I am able to annotate thousands of samples representing millions of cells in less than a few hours and then just merge the results.

sabrina0701 commented 9 months ago

Hi Ido,

Many thanks, this really helps. I will run scATOMIC for each sample. Last question, do you have any experience to analysis one sample with more than 45000 cells, I failed for these samples when run scATOMIC.

Best regards, Sabrina

Ido Nofech-Mozes @.***> 于2023年9月13日周三 19:45写道:

Hi Sabrina, For both scATOMIC, copyKAT (backend of CNV mode), and any other CNV method (eg, inferCNV) the models make assumptions that there is only one cancer sample at a time. If you want an accurate prediction of malignant vs non malignant in cell types that are not immune or stromal cells and unique normal tissue cells then it is indeed important to run it per sample. If you are working on a high performance cluster you should be able to run scATOMIC annotation jobs for each sample and it really shouldnt take a long time, definitely not weeks... On our servers I am able to annotate thousands of samples representing millions of cells in less than a few hours and then just merge the results.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

inofechm commented 9 months ago

45 000 cells from one samples seems like a large amount, I believe the largest samples I have tried are ~25 000 cells. I don't think it should be a problem if the hardware you are using is powerful in terms of memory. What is your maximum memory allowance? One thing that may speed up things if if you set confidence_cutoff = F in run_scATOMIC and create_summary_matrix this will bypass the confidence cutoff step and score each cell to a terminal class. sometimes this code can get slow at this step. can you provide me an error log as where it failed on your end?

sabrina0701 commented 9 months ago

Hi Ido,

Sorry for the late reply,

It took some time to require a huge memory on UCL clusters, I have already asked 1T memory which I thought should be enough for 45000 cells since 200G memory would work for 25000 cells. However, it still failed since "memory not mapped", attached you would find the detailed log files.

BW Sabrina

Ido Nofech-Mozes @.***> 于2023年9月14日周四 15:32写道:

45 000 cells from one samples seems like a large amount, I believe the largest samples I have tried are ~25 000 cells. I don't think it should be a problem if the hardware you are using is powerful in terms of memory. What is your maximum memory allowance? One thing that may speed up things if if you set confidence_cutoff = F in run_scATOMIC and create_summary_matrix this will bypass the confidence cutoff step and score each cell to a terminal class. sometimes this code can get slow at this step. can you provide me an error log as where it failed on your end?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

inofechm commented 9 months ago

Hi Sabrina, You definitely don't need 1T for 45 000 cells, 200G should be more than enough. do you know what step this is happening at? It could also be an issue with multicores where one core doesnt have enough memory so maybe trying with mc.cores=1 would work

sabrina0701 commented 9 months ago

It happened in create_summary_matrix.

Sure, I will try with mc.cores=1.

Ido Nofech-Mozes @.***> 于2023年9月16日周六 21:06写道:

Hi Sabrina, You definitely don't need 1T for 45 000 cells, 200G should be more than enough. do you know what step this is happening at? It could also be an issue with multicores where one core doesnt have enough memory so maybe trying with mc.cores=1 would work

— Reply to this email directly, view it on GitHub https://github.com/abelson-lab/scATOMIC/issues/18#issuecomment-1722305941, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVF2QL4VQEZPS34KWSWPGGTX2YBCZANCNFSM6AAAAAA4RBTHTU . You are receiving this because you authored the thread.Message ID: @.***>

inofechm commented 9 months ago

Do you know if it is happening with the copyKAT CNV step (the part where it will print 'step1: read and filter data' ...) or with the scATOMIC step (after those steps, should have some outputs from MAGIC printed out when that is running)?

sabrina0701 commented 9 months ago

Hi, it happened after copyKAT step, outputs from MAGIC, it works after I used 1 core. BW, Sabrina

Ido Nofech-Mozes @.***> 于2023年9月18日周一 15:14写道:

Do you know if it is happening with the copyKAT CNV step (the part where it will print 'step1: read and filter data' ...) or with the scATOMIC step (after those steps, should have some outputs from MAGIC printed out when that is running)?

— Reply to this email directly, view it on GitHub https://github.com/abelson-lab/scATOMIC/issues/18#issuecomment-1723515146, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVF2QL7RHCJRYWIF7K7HZWLX3BJOFANCNFSM6AAAAAA4RBTHTU . You are receiving this because you authored the thread.Message ID: @.***>

inofechm commented 9 months ago

Ok so I would just recommend using one core or fewer than 10 at least when you have lots of cells. Good luck with your analyses!

sabrina0701 commented 9 months ago

Really appreciate for your patience, many thanks!

Ido Nofech-Mozes @.***> 于2023年9月19日周二 15:32写道:

Ok so I would just recommend using one core or fewer than 10 at least when you have lots of cells. Good luck with your analyses!

— Reply to this email directly, view it on GitHub https://github.com/abelson-lab/scATOMIC/issues/18#issuecomment-1725764643, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVF2QL2Q7QPNPFR2I2WQMTTX3GUGXANCNFSM6AAAAAA4RBTHTU . You are receiving this because you authored the thread.Message ID: @.***>