Run successfully with demo data, run with my own real data to report errors.

1)inputfile(reads count on 20 genes) ,DSC3,IGF2BP1,KCNK3,IGJ,PTPRM,KCNN1,CD99,HAP1,AGAP1,CHST2,SLC2A5,IFITM1,ABHD3,PCLO,TNS1,NCF2,MIB1,TCFL5,ARHGEF4,RGMA hezhiyan,0,1,0,0,0,0,8,0,0,1,0,0,0,2,0,5,0,6,7,0 LIANGXIAOLAN,0,0,0,9,0,0,0,1,0,6,0,1,2,6,0,0,0,1,10,0 LIAOJIFENG,0,0,9,0,12,0,0,0,13,207,0,2,0,1,0,2,0,11,108,0 lilingyi,0,0,0,0,14,1,0,0,16,85,0,0,1,0,0,4,4,26,2,0 LXINZHE,4,37,0,0,1,23,9,13,1,1,0,0,13,23,0,8,1,6,3,4 NGR200625001,1,0,9,10,1,0,125,5,1,9,35,13,2,1,13,21,12,18,0,0 qianli,1,0,0,0,1,1,0,0,1,10,0,5,0,4,0,4,0,3,2,1 qinpeixiang,0,0,0,88,1,1,109,1,8,177,5,203,7,0,408,83,20,26,8,10 luozhimin,0,0,0,1,0,2,504,1,0,0,48,168,1,8,29,44,20,78,0,0 luoyangying,0,0,0,8,4,23,301,4,11,30,28,310,10,7,55,200,3,27,10,4 huangzhegnhan,7,1,0,7,4,5,979,4,79,9,230,75,22,16,158,89,84,288,1,1 chenziyang,7,4,0,2,8,70,86,1,0,14,22,52,28,70,9,26,26,291,16,3 zhuxiangyan,0,0,0,0,3,0,486,0,0,2,114,97,5,1,10,17,7,94,1,2 zhaojinsheng,0,0,0,4,0,1,88,0,0,47,60,9,2,0,10,21,2,27,1,0 qinzuofa,0,0,0,0,1,0,219,0,0,25,26,145,2,0,3,177,4,50,0,0 huanglianbin,0,0,0,0,0,0,267,0,0,4,6,24,3,0,4,25,5,28,1,0 xuxunjia,0,0,0,0,0,2,5,0,0,0,2,4,2,0,0,9,0,0,0,0 xuyongyi,1,0,0,0,0,5,6,0,1,7,0,12,3,7,1,14,2,0,4,0 ZYZ,0,0,0,0,0,0,2,2,1,0,0,0,5,2,1,4,0,0,6,0

2)command line python ALLSorts -samples reverse.gene.forRF.classifier.HTseqCount.xls -destination test/

3）Error message

Prediction Mode

Loading classifier... Saving predictions... /home/hanxl/TEST/ExpressionProfile/LIMINGZE/ph-like_method/AllSorts/tools/MoRP/morp/morp.py:86: RuntimeWarning: divide by zero encountered in true_divide np.divide(counts_normalised, /home/hanxl/TEST/ExpressionProfile/LIMINGZE/ph-like_method/AllSorts/tools/MoRP/morp/morp.py:86: RuntimeWarning: invalid value encountered in true_divide np.divide(counts_normalised, /home/hanxl/TEST/ExpressionProfile/LIMINGZE/ph-like_method/AllSorts/tools/MoRP/morp/morp.py:109: RuntimeWarning: invalid value encountered in true_divide if (original/scaler) == normalised: /home/hanxl/Software/anaconda3/envs/allsorts/lib/python3.8/site-packages/numpy/lib/function_base.py:3942: RuntimeWarning: invalid value encountered in multiply x2 = take(ap, indices_above, axis=axis) * weights_above /home/hanxl/Software/anaconda3/envs/allsorts/lib/python3.8/site-packages/numpy/lib/nanfunctions.py:1115: RuntimeWarning: All-NaN slice encountered r, k = function_base._ureduce(a, func=_nanmedian, axis=axis, out=out, Traceback (most recent call last): File "/home/hanxl/Software/anaconda3/envs/allsorts/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/hanxl/Software/anaconda3/envs/allsorts/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "ALLSorts/main.py", line 20, in allsorts.run() File "ALLSorts/allsorts.py", line 90, in run probabilities = allsorts_clf.predict_proba(ui.samples, parents=ui.parents) File "ALLSorts/pipeline.py", line 115, in predict_proba Xt = transform.transform(Xt) File "ALLSorts/stages/feature_creation.py", line 310, in transform self._iamp21Feature(counts), File "ALLSorts/stages/feature_creation.py", line 200, in _iamp21Feature temp = bin_counts.apply(median_filter, mode="constant", size=11, axis=1) File "/home/hanxl/Software/anaconda3/envs/allsorts/lib/python3.8/site-packages/pandas/core/frame.py", line 6878, in apply return op.get_result() File "/home/hanxl/Software/anaconda3/envs/allsorts/lib/python3.8/site-packages/pandas/core/apply.py", line 180, in get_result return self.apply_empty_result() File "/home/hanxl/Software/anaconda3/envs/allsorts/lib/python3.8/site-packages/pandas/core/apply.py", line 220, in apply_empty_result return self.obj._constructor_sliced(r, index=self.agg_axis) File "/home/hanxl/Software/anaconda3/envs/allsorts/lib/python3.8/site-packages/pandas/core/series.py", line 291, in init raise ValueError( ValueError: Length of passed values is 0, index implies 19.

Thanks for giving ALLSorts a go! Let's see if we can get it working for you.

Please try converting your .xls file to a .csv (it should be fine to just export from excel) and giving it another go.

In addition, ALLSorts uses a large set of genes in order to make predictions. If your example above is what you are inputting (20 genes), the classifier will likely run into an error as it needs a minimum set to work. Refer to the wiki entry 0. Counts matrix format for more specific instructions regarding this.

If you need a hand, don't hesitate to ask! Otherwise, let me know how you go.

The description in the literature made me think that only 20 genes can be used for classification, in fact, there are more than 10,000 genes. I have successfully run it on my own data. What is the difference between Ph group, Ph and Ph like classification in the result?It seems that the predictions of KMT2A and TCF3-PBX1 categories are relatively accurate, and the other accuracy is not enough

xuelian2008ing@sina.com

From: Breon Schmidt Date: 2020-08-21 11:39 To: Oshlack/ALLSorts CC: hanxuelian; Author Subject: Re: [Oshlack/ALLSorts] ValueError: Length of passed values is 0, index implies 19. (#5) Thanks for giving ALLSorts a go! Let's see if we can get it working for you. Please try converting your .xls file to a .csv (it should be fine to just export from excel) and giving it another go. In addition, ALLSorts uses a large set of genes in order to make predictions. If your example above is what you are inputting (20 genes), the classifier will likely run into an error as it needs these genes to work. Refer to the wiki entry 0. Counts matrix format for more specific instructions regarding this. If you need a hand, don't hesitate to ask! Otherwise, let me know how you go. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

Edit: I have realised that there is a direct link to this repository from a recent paper that used the first version, hence the confusion. My fault, I should have realised. Thank you for highlighting this.

Good to see you got it working, let me address your questions:

The description in the literature made me think that only 20 genes can be used for classification.

Noticing that your input file is named forRF.classifier I think that you may have mistaken this version of ALLSorts for the previous attempt which was a Random Forest Classifier (found here). This version is distinct from that one and does not use Random Forest - it is a complete reimagining. Apologies if that has been a source for confusion! I will try to highlight this better in the welcome page.

in fact, there are more than 10,000 genes

In this version, a large set of genes is required for input as there are some custom features which are created from many. This is very different from how the original method worked, which I think only did use ~20 genes (I did not build that method personally).

What is the difference between Ph group, Ph and Ph-like classification in the result

Ph and Ph-like share a similar transcriptional signal, but Ph-like is defined as lacking the BCR:ABL1 fusion gene. Ph Group is a meta-subtype that encapsulates both Ph and Ph-like.

This version of ALLSorts uses a hierarchical classification approach. In the case of Ph/Ph-like/Ph Group, the classifier will first attempt to attribute a sample to according to Ph Group, then it will attempt to classify between Ph and Ph-like.

The subtypes available within this classifier are listed via the reference below, with some differences (e.g. the inclusion of KMT2A/ZNF384/Ph/High signature groups and the lack of CRLF2(non-Ph-like)). I am still in the process of including this information within this Wiki.

Gu, Z., Churchman, M. L., Roberts, K. G., Moore, I., Zhou, X., Nakitandwe, J., … Mullighan, C. G. (2019). PAX5-driven subtypes of B-progenitor acute lymphoblastic leukemia. Nature Genetics, 51(2), 296–307.

It seems that the predictions of KMT2A and TCF3-PBX1 categories are relatively accurate, and the other accuracy is not enough

Keen to understand this more. Are the samples you're referring to being classified as "Unclassified" or are they being attributed to a different subtype than you expected?

If you cannot go into details here but you do want to discuss your results, feel free to e-mail me at breon.schmidt@petermac.org and we can look at the results privately.

I am very grateful for your patient answers, I have fully understood this version of ALLSorts.

best wishes

xuelian

xuelian2008ing@sina.com

From: Breon Schmidt Date: 2020-08-21 17:12 To: Oshlack/ALLSorts CC: hanxuelian; Author Subject: Re: [Oshlack/ALLSorts] ValueError: Length of passed values is 0, index implies 19. (#5) Good to see you got it working, let me address your questions: The description in the literature made me think that only 20 genes can be used for classification. Noticing that your input file is named forRF.classifier I think that you may have mistaken this version of ALLSorts for the previous attempt which was a Random Forest Classifier (found here 1). This version is distinct from that one and does not use Random Forest - it is a completely. Apologies if that has been a source for confusion! I will try to highlight this better in the welcome page. in fact, there are more than 10,000 genes In this version, a large set of genes is required for input as there are some custom features which are created from many. This is very different from how the original method worked, which I think only did use ~20 genes (I did not build that method personally). What is the difference between Ph group, Ph and Ph-like classification in the result Ph and Ph-like share a similar transcriptional signal, but Ph-like is defined as lacking the BCR:ABL1 fusion gene. Ph Group is a meta-subtype that encapsulates both Ph and Ph-like. This version of ALLSorts uses a hierarchical classification approach. In the case of Ph/Ph-like/Ph Group, the classifier will first attempt to attribute a sample to according to Ph Group, then it will attempt to classify between Ph and Ph-like. The subtypes available within this classifier are listed via the reference below, with some differences (e.g. the inclusion of KMT2A/ZNF384/Ph/High signature groups and the lack of CRLF2(non-Ph-like)). I am still in the process of including this information within this Wiki. Gu, Z., Churchman, M. L., Roberts, K. G., Moore, I., Zhou, X., Nakitandwe, J., … Mullighan, C. G. (2019). PAX5-driven subtypes of B-progenitor acute lymphoblastic leukemia. Nature Genetics, 51(2), 296–307. It seems that the predictions of KMT2A and TCF3-PBX1 categories are relatively accurate, and the other accuracy is not enough Keen to understand this more. Are the samples you're referring to being classified as "Unclassified" or are they being attributed to a different subtype than you expected? If you cannot go into details here but you do want to discuss your results, feel free to e-mail me at breon.schmidt@petermac.org and we can discuss the results privately. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

No worries, closing the issue.

All the best.

Oshlack / ALLSorts

ValueError: Length of passed values is 0, index implies 19. #5

3）Error message

Prediction Mode