CRI-iAtlas / ImmuneSubtypeClassifier

An R package for classification of immune subtypes, in cancer, using gene expression data.
Other
40 stars 23 forks source link

issue with last version of xgboost #6

Open dvenet opened 4 years ago

dvenet commented 4 years ago

Hi,

The package seems to be incompatible with the last version of xgboost, probably because of the issue outlined here - https://github.com/dmlc/xgboost/issues/5794

Using version 1.0.0.2 of xgboost works OK (albeit it warns that models were built in xgboost < 1.0.0, might be good to upgrade them anyway).

The error message I got was - Error in predict.xgb.Booster(mi$bst, Xbin) : [12:28:19] amalgamation/../src/learner.cc:506: Check failed: mparam_.num_feature != 0 (0 vs. 0) : 0 feature is supplied. Are you using raw Booster interface?

Gibbsdavidl commented 4 years ago

Thanks for letting me know... I'll check it out. Looks like it needs to be updated. -dave

Gibbsdavidl commented 3 years ago

Right now, until I can get the saved models saved as part of the new version... this works:

require(devtools) install_version("xgboost", version = "1.0.0.1")

David-Caceres commented 11 months ago

Hi,

I'm getting a similar warning:

[10:58:49] WARNING: amalgamation/../src/learner.cc:556: Loading model from XGBoost < 1.0.0, consider saving it again for improved compatibility

I revesed the last version to the recommended one:

packageVersion("xgboost") [1] ‘1.0.0.1’

Now I'm running this version and getting the same warning.

Is the classification reliable with this warning?

Thanks

David

Gibbsdavidl commented 11 months ago

Hi there!

I was thinking about this and I think it should be OK. Interesting though... if you're on version 1.0.0.1, where is that warning coming from?

You might try classifying some known TCGA samples to convince yourself that it's working. The subtype labels can be found here in the package: inst/data/five_signature_mclust_ensemble_results.tsv.gz

I'm pretty far along with releasing a new version, it's built on a more general platform: https://github.com/Gibbsdavidl/robencla/

Thanks for your comment, good motivation to fix this! :-)

David-Caceres commented 11 months ago

Hi there!

I was thinking about this and I think it should be OK. Interesting though... if you're on version 1.0.0.1, where is that warning coming from?

You might try classifying some known TCGA samples to convince yourself that it's working. The subtype labels can be found here in the package: inst/data/five_signature_mclust_ensemble_results.tsv.gz

I'm pretty far along with releasing a new version, it's built on a more general platform: https://github.com/Gibbsdavidl/robencla/

Thanks for your comment, good motivation to fix this! :-)

Thanks for your anwser, I try to classifiy the TCGA samples and the result was correct even with this warning, so the job looks to be done.

We are running 3`UTR samples and getting some odd results, we think it could exist any bias for gene length between our samples and TCGA .We want to compare our gene expression in the 5 geneset signtarues to check for any issue. It would be fine to know the genes which belongs to each signature and also known how the expression between samples is compared, I have checked this file https://github.com/CRI-iAtlas/ImmuneSubtypeClassifier/blob/master/inst/important_features_in_the_ensemble_model.tsv but i'm not sure whate genes should we compare to check differences between aour gene expression and TCGA in the 5 signatures you used in your model.

Thanks in advance.

David

Gibbsdavidl commented 11 months ago

Huh, interesting. Not sure that’s been tried before.

The gene sets can be found here, under “Gene Expression Signatures"

The Immune Landscape of Cancer | NCI Genomic Data Commons https://gdc.cancer.gov/about-data/publications/panimmune gdc.cancer.gov https://gdc.cancer.gov/about-data/publications/panimmune [image: favicon.ico] https://gdc.cancer.gov/about-data/publications/panimmune https://gdc.cancer.gov/about-data/publications/panimmune https://gdc.cancer.gov/about-data/publications/panimmune

It has both the extensive set of 160 signatures as well as the five core sets using for immune classification.

Hope that helps!

On Fri, Aug 4, 2023 at 8:07 AM David Caceres @.***> wrote:

Hi there!

I was thinking about this and I think it should be OK. Interesting though... if you're on version 1.0.0.1, where is that warning coming from?

You might try classifying some known TCGA samples to convince yourself that it's working. The subtype labels can be found here in the package: inst/data/five_signature_mclust_ensemble_results.tsv.gz

I'm pretty far along with releasing a new version, it's built on a more general platform: https://github.com/Gibbsdavidl/robencla/

Thanks for your comment, good motivation to fix this! :-)

Thanks for your anwser, I try to classifiy the TCGA samples and the result was correct even with this warning, so the job looks to be done.

We are running 3`UTR samples and getting some odd results, we think it could exist any bias for gene length between our samples and TCGA .We want to compare our gene expression in the 5 geneset signtarues to check for any issue. It would be fine to know the genes which belongs to each signature and also known how the expression between samples is compared, I have checked this file https://github.com/CRI-iAtlas/ImmuneSubtypeClassifier/blob/master/inst/important_features_in_the_ensemble_model.tsv but i'm not sure whate genes should we compare to check differences between aour gene expression and TCGA in the 5 signatures you used in your model.

Thanks in advance.

David

— Reply to this email directly, view it on GitHub https://github.com/CRI-iAtlas/ImmuneSubtypeClassifier/issues/6#issuecomment-1665759355, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEJFSJOJJ3VMZHW2W2KYBLXTUFZVANCNFSM4OFRAXLA . You are receiving this because you commented.Message ID: @.***>

David-Caceres commented 10 months ago

Thank you very much David,

But I can´t find wich specific genes belongs to each signature, maybe Iḿ missing something. My goal is to check the quantitative relationship between top gene pairs in each signature to compare my data VS TCGA data.

We found a lack o quality in many of our samples, and we need to check if certain predictions are reliable, with many under counted genes in our samples is not easy to work.

I appreciate your help.

David

Gibbsdavidl commented 10 months ago

Hi David, On that page, I linked to above, I downloaded the file PanImmune_Geneset_Definitions. I will put it here. See the tab named "Genes".

-dave

PanImmune_GeneSet_Definitions.xlsx

David-Caceres commented 10 months ago

Thanks man,

I was really missing something.

And I hope I'm not driving you crazy with so many questions, but I have another one.

I got the scores for each geneset using your guide https://github.com/Gibbsdavidl/Immune-Subtype-Clustering/blob/master/Notebooks/How_to_produce_gene_set_scores.ipynb.

I matched the results for the test data "EBPlusPlusAdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.tsv", then I tried to get the scores for TCGA data as you did here: "five_signature_mclust_ensemble_results.tsv.gz", and I got different results; I tried it using Upper quartile normalization, and using raw counts, scores were closer to your results with raw counts but did not match.

Am I missing something again?

Cheers

David

Gibbsdavidl commented 10 months ago

Yeah, no problem! Personally, I would avoid getting too into the scoring method used... it was unusual and strongly affected on what samples were present during normalization (the median scaling part). I think ssGSEA or singscore, both single sample rank based gene set scoring methods are better. That said...

When you say you matched the results... does that mean with the EB++ expression data, you reproduced the 160 signature scores?

Then you were getting scores for TCGA.. what does that mean exactly? The EB++ are TCGA.

It's been some years, so my recollection is not super clear, but the method depended having all the TCGA samples (~9K samples) and doing gene-based median scaling with that set of samples. Any deviation and the scores won't match exactly. That's why it's not really a general method for scoring, it was just used to produce scores that went into clustering (with Mclust) so all samples could be clustered together.

David-Caceres commented 10 months ago

When you say you matched the results... does that mean with the EB++ expression data, you reproduced the 160 signature scores?

Yes

Then you were getting scores for TCGA.. what does that mean exactly? The EB++ are TCGA.

I used the colorectal (COAD+READ) cohort and I obtained scores that did not matched "five_signature_mclust_ensemble_results.tsv.gz" for the same samples. Final clustering is identical.

As I said, we are clustering our own samples (3'UTR from paraffins). These are low quality samples and we are worried about the accuracy of your model on them. I have identified that the worst quality samples (under 90% of signatures genes content), tend to classify massively in group 4 (lymphocyte depleted), in addition we have a probable overrepresentation of samples classified in group 3 (inflammatory) and I'm worried about any error in the scores calculation.

I also checked the expression of the top gain 10 pairs of genes in each group after filtering lower quality samples (https://github.com/CRI-iAtlas/ImmuneSubtypeClassifier/blob/master/inst/important_features_in_the_ensemble_model.tsv), comparing my data with the mentioned colorectal TCGA cohort. I obtained the following results of equivalence. Group 1 good, group 2 ok, group 3 bad, group 4 good, (not enough samples in groups 5 and 6 to do it).

Our goal is to check the reliability of the clustering in our samples. We know that lower quality samples are missclasified in group 4 (it can be corrected after quality selection), but the question is: is there any other bias that causes the overrepresentation of samples classfied in group 3?

Best regards

David

Gibbsdavidl commented 10 months ago

Hi David, For the scores that didn't match, were those from the output of the classifier? Those could vary because I changed the implementation a while back, but as long as the clustering is identical, that's the important thing.

Hmmm, haven't had much experience with low quality samples... but I had a thought. What if you did an experiment, where you took the TCGA COAD or READ samples, and did some random dropouts or degradation to make those samples look like your paraffin data? Then you could see what would happen!

I guess I'm not sure what would push samples into group 3.

-dave

On Thu, Aug 31, 2023 at 2:24 AM David Caceres @.***> wrote:

When you say you matched the results... does that mean with the EB++ expression data, you reproduced the 160 signature scores?

Yes

Then you were getting scores for TCGA.. what does that mean exactly? The EB++ are TCGA.

I used the colorectal (COAD+READ) cohort and I obtained scores that did not matched "five_signature_mclust_ensemble_results.tsv.gz" for the same samples. Final clustering is identical.

As I said, we are clustering our own samples (3'UTR from paraffins). These are low quality samples and we are worried about the accuracy of your model on them. I have identified that the worst quality samples (under 90% of signatures genes content), tend to classify massively in group 4 (lymphocyte depleted), in addition we have a probable overrepresentation of samples classified in group 3 (inflammatory) and I'm worried about any error in the scores calculation.

I also checked the expression of the top gain 10 pairs of genes in each group after filtering lower quality samples ( https://github.com/CRI-iAtlas/ImmuneSubtypeClassifier/blob/master/inst/important_features_in_the_ensemble_model.tsv), comparing my data with the mentioned colorectal TCGA cohort. I obtained the following results of equivalence. Group 1 good, group 2 ok, group 3 bad, group 4 good, (not enough samples in groups 5 and 6 to do it).

Our goal is to check the reliability of the clustering in our samples. We know that lower quality samples are missclasified in group 4 (it can be corrected after quality selection), but the question is: is there any other bias that causes the overrepresentation samples classfied in group 3?

Best regards

David

— Reply to this email directly, view it on GitHub https://github.com/CRI-iAtlas/ImmuneSubtypeClassifier/issues/6#issuecomment-1700680831, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEJFSIWGQ5NGD5RBHHYLK3XYBJ33ANCNFSM4OFRAXLA . You are receiving this because you commented.Message ID: @.***>