asaki1986 commented 1 month ago

Hi,

I am using the calculated data from STAR to infer the cancer type using TULIP.

There are many genes without expression value, and I encountered some mistake after running the command, which is listed as follows.

Number of missing genes: 17029 Adding missing genes with expression values set at 0. Using model trained on protein coding genes only and 32 tumor types. Model: "sequential"

Layer (type) Output Shape Param #

conv1d (Conv1D) (None, 19739, 128) 2688

activation (Activation) (None, 19739, 128) 0

max_pooling1d (MaxPooling1D) (None, 19739, 128) 0

conv1d_1 (Conv1D) (None, 19720, 128) 327808

activation_1 (Activation) (None, 19720, 128) 0

max_pooling1d_1 (MaxPooling1 (None, 1972, 128) 0

flatten (Flatten) (None, 252416) 0

dense (Dense) (None, 200) 50483400

activation_2 (Activation) (None, 200) 0

dropout (Dropout) (None, 200) 0

dense_1 (Dense) (None, 20) 4020

activation_3 (Activation) (None, 20) 0

dropout_1 (Dropout) (None, 20) 0

dense_2 (Dense) (None, 32) 672

activation_4 (Activation) (None, 32) 0

Total params: 50,818,588 Trainable params: 50,818,588 Non-trainable params: 0

Loading weights from: models/cnn_32_pc_weights.h5 Getting predictions. WARNING:tensorflow:Model was constructed with shape (None, 19758, 1) for input KerasTensor(type_spec=TensorSpec(shape=(None, 19758, 1), dtype=tf.float32, name='conv1d_input'), name='conv1d_input', description="created by layer 'conv1d_input'"), but it was called on an input with incompatible shape (None, 2734, 1). Traceback (most recent call last): File "tulip.py", line 344, in <module> main() File "tulip.py", line 310, in main predictions = selected_model.predict(input_data) File "/home/jjiang/.local/lib/python3.6/site-packages/keras/engine/training.py", line 1751, in predict tmp_batch_outputs = self.predict_function(iterator) File "/home/jjiang/.local/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 885, in __call__ result = self._call(*args, **kwds) File "/home/jjiang/.local/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 933, in _call self._initialize(args, kwds, add_initializers_to=initializers) File "/home/jjiang/.local/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 760, in _initialize *args, **kwds)) File "/home/jjiang/.local/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 3066, in _get_concrete_function_internal_garbage_collected graph_function, _ = self._maybe_define_function(args, kwargs) File "/home/jjiang/.local/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 3463, in _maybe_define_function graph_function = self._create_graph_function(args, kwargs) File "/home/jjiang/.local/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 3308, in _create_graph_function capture_by_value=self._capture_by_value), File "/home/jjiang/.local/lib/python3.6/site-packages/tensorflow/python/framework/func_graph.py", line 1007, in func_graph_from_py_func func_outputs = python_func(*func_args, **func_kwargs) File "/home/jjiang/.local/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 668, in wrapped_fn out = weak_wrapped_fn().__wrapped__(*args, **kwds) File "/home/jjiang/.local/lib/python3.6/site-packages/tensorflow/python/framework/func_graph.py", line 994, in wrapper raise e.ag_error_metadata.to_exception(e)

ValueError: in user code:

/home/jjiang/.local/lib/python3.6/site-packages/keras/engine/training.py:1586 predict_function  *
    return step_function(self, iterator)
/home/jjiang/.local/lib/python3.6/site-packages/keras/engine/training.py:1576 step_function  **
    outputs = model.distribute_strategy.run(run_step, args=(data,))
/home/jjiang/.local/lib/python3.6/site-packages/tensorflow/python/distribute/distribute_lib.py:1286 run
    return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
/home/jjiang/.local/lib/python3.6/site-packages/tensorflow/python/distribute/distribute_lib.py:2849 call_for_each_replica
    return self._call_for_each_replica(fn, args, kwargs)
/home/jjiang/.local/lib/python3.6/site-packages/tensorflow/python/distribute/distribute_lib.py:3632 _call_for_each_replica
    return fn(*args, **kwargs)
/home/jjiang/.local/lib/python3.6/site-packages/keras/engine/training.py:1569 run_step  **
    outputs = model.predict_step(data)
/home/jjiang/.local/lib/python3.6/site-packages/keras/engine/training.py:1537 predict_step
    return self(x, training=False)
/home/jjiang/.local/lib/python3.6/site-packages/keras/engine/base_layer.py:1037 __call__
    outputs = call_fn(inputs, *args, **kwargs)
/home/jjiang/.local/lib/python3.6/site-packages/keras/engine/sequential.py:369 call
    return super(Sequential, self).call(inputs, training=training, mask=mask)
/home/jjiang/.local/lib/python3.6/site-packages/keras/engine/functional.py:415 call
    inputs, training=training, mask=mask)
/home/jjiang/.local/lib/python3.6/site-packages/keras/engine/functional.py:550 _run_internal_graph
    outputs = node.layer(*args, **kwargs)
/home/jjiang/.local/lib/python3.6/site-packages/keras/engine/base_layer.py:1020 __call__
    input_spec.assert_input_compatibility(self.input_spec, inputs, self.name)
/home/jjiang/.local/lib/python3.6/site-packages/keras/engine/input_spec.py:254 assert_input_compatibility
    ' but received input with shape ' + display_shape(x.shape))

ValueError: Input 0 of layer dense is incompatible with the layer: expected axis -1 of input shape to have value 252416 but received input with shape (None, 34432)

Any suggestions?

asaki1986 commented 1 month ago

I rerun the command using FPKM_UQ values calculated from FPKM-UQ.py within rseqc. Another error came out as,

Number of missing genes: 19758 Adding missing genes with expression values set at 0. Traceback (most recent call last): File "tulip.py", line 344, in <module> main() File "tulip.py", line 269, in main input_data = prep_data(features) File "tulip.py", line 128, in prep_data tpm_df = df.div(df.sum(axis=1), axis=0) File "/home/jjiang/.local/lib/python3.6/site-packages/pandas/core/ops/__init__.py", line 655, in f new_data = self._combine_frame(other, na_op, fill_value) File "/home/jjiang/.local/lib/python3.6/site-packages/pandas/core/frame.py", line 5870, in _combine_frame new_data = ops.dispatch_to_series(self, other, _arith_op) File "/home/jjiang/.local/lib/python3.6/site-packages/pandas/core/ops/__init__.py", line 275, in dispatch_to_series bm = left._mgr.operate_blockwise(right._mgr, array_op) File "/home/jjiang/.local/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 367, in operate_blockwise return operate_blockwise(self, other, array_op) File "/home/jjiang/.local/lib/python3.6/site-packages/pandas/core/internals/ops.py", line 38, in operate_blockwise res_values = array_op(lvals, rvals) File "/home/jjiang/.local/lib/python3.6/site-packages/pandas/core/ops/array_ops.py", line 190, in arithmetic_op res_values = na_arithmetic_op(lvalues, rvalues, op) File "/home/jjiang/.local/lib/python3.6/site-packages/pandas/core/ops/array_ops.py", line 143, in na_arithmetic_op result = expressions.evaluate(op, left, right) File "/home/jjiang/.local/lib/python3.6/site-packages/pandas/core/computation/expressions.py", line 233, in evaluate return _evaluate(op, op_str, a, b) # type: ignore File "/home/jjiang/.local/lib/python3.6/site-packages/pandas/core/computation/expressions.py", line 68, in _evaluate_standard return op(a, b) ZeroDivisionError: float division by zero

And it seems that the FPKM_UQ value was much smaller than those provided in example_data.

Any recommended script that I can use to convert the bam/sam file from mapping to FPKM_UQ?

satishbioinfo commented 1 month ago

Hello,

Can you share a snapshot of your input matrix? Also you can try using htseq-count for FPKM_UQ value generation (based on GDC portal recommendation). From the error I am thinking that your input matrix is having issues.

Regards, Satish

asaki1986 commented 1 month ago

Hi Dr. Satish,

I first create the conda env as guided. The example data did PASS the test. But errors came out either with our data or data fetched from GDC.

Here is the first 40 rows of my input file. Since the calculated FPKM_UQ was integer, I multiply the cols with 1.001 to force float. And the output file did not contain gene symbols, I used gene_id in the second col instead.

ENSG00000223972.4,ENSG00000223972.4,14.014 ENSG00000227232.4,ENSG00000227232.4,432.432 ENSG00000243485.2,ENSG00000243485.2,1.001 ENSG00000237613.2,ENSG00000237613.2,0 ENSG00000268020.2,ENSG00000268020.2,0 ENSG00000240361.1,ENSG00000240361.1,0 ENSG00000186092.4,ENSG00000186092.4,0 ENSG00000238009.2,ENSG00000238009.2,2.002 ENSG00000239945.1,ENSG00000239945.1,0 ENSG00000233750.3,ENSG00000233750.3,7.007 ENSG00000237683.5,ENSG00000237683.5,115.115 ENSG00000268903.1,ENSG00000268903.1,1.001 ENSG00000269981.1,ENSG00000269981.1,0 ENSG00000239906.1,ENSG00000239906.1,4.004 ENSG00000241860.2,ENSG00000241860.2,71.071 ENSG00000222623.1,ENSG00000222623.1,1.001 ENSG00000241599.1,ENSG00000241599.1,18.018 ENSG00000228463.4,ENSG00000228463.4,18.018 ENSG00000241670.2,ENSG00000241670.2,0 ENSG00000237094.7,ENSG00000237094.7,85.085 ENSG00000250575.1,ENSG00000250575.1,4.004 ENSG00000233653.3,ENSG00000233653.3,74.074 ENSG00000224813.2,ENSG00000224813.2,0 ENSG00000235249.1,ENSG00000235249.1,3.003 ENSG00000269732.1,ENSG00000269732.1,0 ENSG00000256186.1,ENSG00000256186.1,0 ENSG00000236601.1,ENSG00000236601.1,0 ENSG00000236743.1,ENSG00000236743.1,0 ENSG00000236679.2,ENSG00000236679.2,5.005 ENSG00000231709.1,ENSG00000231709.1,0 ENSG00000235146.2,ENSG00000235146.2,0 ENSG00000239664.2,ENSG00000239664.2,0 ENSG00000230021.3,ENSG00000230021.3,11.011 ENSG00000223659.1,ENSG00000223659.1,0 ENSG00000225972.1,ENSG00000225972.1,376.376 ENSG00000225630.1,ENSG00000225630.1,339.339 ENSG00000237973.1,ENSG00000237973.1,1605.6 ENSG00000229344.1,ENSG00000229344.1,97.097 ENSG00000240409.1,ENSG00000240409.1,0 ENSG00000248527.1,ENSG00000248527.1,1883.88

And I also download a tsv file from GDC (a4beb0e4-f899-48f4-baf1-830039f9b345.rna_seq.augmented_star_gene_counts), and extract the gene_id,gene_name,fpkm_uq_unstranded col from the file. gene_id,gene_name,fpkm_uq_unstranded ENSG00000000003.15,TSPAN6,20.2140 ENSG00000000005.6,TNMD,0.0173 ENSG00000000419.13,DPM1,26.7597 ENSG00000000457.14,SCYL3,4.7112 ENSG00000000460.17,C1orf112,2.7447 ENSG00000000938.13,FGR,0.0038 ENSG00000000971.16,CFH,0.2485 ENSG00000001036.14,FUCA2,21.7214 ENSG00000001084.13,GCLC,13.9093 ENSG00000001167.14,NFYA,15.7420 ENSG00000001460.18,STPG1,0.9333 ENSG00000001461.17,NIPAL3,5.2795 ENSG00000001497.18,LAS1L,2.7264 ENSG00000001561.7,ENPP4,9.4615 ENSG00000001617.12,SEMA3F,2.8891 ENSG00000001626.16,CFTR,0.6292 ENSG00000001629.10,ANKIB1,10.0131 ENSG00000001630.17,CYP51A1,6.2237 ENSG00000001631.16,KRIT1,0.8454

And it report similar errors as " ValueError: Input 0 of layer dense is incompatible with the layer: expected axis -1 of input shape to have value 252416 but received input with shape (None, 14336) "

Best, Junfeng

asaki1986 commented 1 month ago

data.tar.gz I attached the csv input files. test.csv is the one I downloaded from GDC, while the other is from our own.

Best, junfeng

asaki1986 commented 1 month ago

Quick update:

I update the ENSG name to decrease the number of missing genes from 17029 to 279, however, similar error came out.

ValueError: Input 0 of layer dense is incompatible with the layer: expected axis -1 of input shape to have value 252416 but received input with shape (None, 249472)

satishbioinfo commented 3 weeks ago

Hello asaki, The issue you are having is because your input file is not in the correct format.

Prepare your input data similar to what we have shown in the "example_data/protein_coding_genes_htseq_fpkm_uq.csv". Make sure the values in columns "gene_id,gene_name" are same corresponding the example file ( you can write a simple script to do that). This should solve your problem.

Please let me know if you have any other questions.

Regards, Satish

asaki1986 commented 3 weeks ago

Hello asaki, The issue you are having is because your input file is not in the correct format.

Prepare your input data similar to what we have shown in the "example_data/protein_coding_genes_htseq_fpkm_uq.csv". Make sure the values in columns "gene_id,gene_name" are same corresponding the example file ( you can write a simple script to do that). This should solve your problem.

Please let me know if you have any other questions.

Regards, Satish

Thanks,

I tried as you suggest and the prediction work well.

And it seems that the weight of "Cervical squamous cell carcinoma and endocervical adenocarcinoma" cancer type is quite high, since two data of UNKOWN PRIMARY is predicted to this including one from GDC with case-id "0455295c-ea0f-408a-acca-a9422e054d66".

csvtk transpose ae9f4e53a59c/predictions_full_32_all.csv -D $'\t' | c samples ae9f4e53a59c Adrenocortical carcinoma 0.0009222297 Bladder Urothelial Carcinoma 0.055023987 Breast invasive carcinoma 0.011912946 Cervical squamous cell carcinoma and endocervical adenocarcinoma 0.42806906 Cholangiocarcinoma 0.0018230672 Colon adenocarcinoma 0.11883814 Lymphoid Neoplasm Diffuse Large B-cell Lymphoma 0.040870328 Esophageal carcinoma 0.046573117 Glioblastoma multiforme 0.0007533289 Head and Neck squamous cell carcinoma 0.06581423 Kidney Chromophobe 0.0032050526 Kidney renal clear cell carcinoma 0.0017174202 Kidney renal papillary cell carcinoma 0.003941357 Brain Lower Grade Glioma 0.00027571517 Liver hepatocellular carcinoma 0.00022769821 Lung adenocarcinoma 0.02721877 Lung squamous cell carcinoma 0.070290215 Mesothelioma 0.0017315596 Ovarian serous cystadenocarcinoma 0.00085877726 Pancreatic adenocarcinoma 0.004201148 Pheochromocytoma and Paraganglioma 0.00030281505 Prostate adenocarcinoma 0.004138819 Rectum adenocarcinoma 0.021881314 Sarcoma 0.0009646255 Skin Cutaneous Melanoma 0.0012491246 Stomach adenocarcinoma 0.042751815 Testicular Germ Cell Tumors 8.55032e-05 Thyroid carcinoma 0.00011139857 Thymoma 0.0194717 Uterine Corpus Endometrial Carcinoma 0.021043217 Uterine Carcinosarcoma 0.0033289029 Uveal Melanoma 0.0004026421 predicted_class Cervical squamous cell carcinoma and endocervical adenocarcinoma probability 0.42806906

samples GW23T369B04 Adrenocortical carcinoma 0.018461222 Bladder Urothelial Carcinoma 0.078229815 Breast invasive carcinoma 0.021386538 Cervical squamous cell carcinoma and endocervical adenocarcinoma 0.10849975 Cholangiocarcinoma 0.015696343 Colon adenocarcinoma 0.06722248 Lymphoid Neoplasm Diffuse Large B-cell Lymphoma 0.06625707 Esophageal carcinoma 0.011855229 Glioblastoma multiforme 0.0050791623 Head and Neck squamous cell carcinoma 0.056671094 Kidney Chromophobe 0.034526385 Kidney renal clear cell carcinoma 0.023273237 Kidney renal papillary cell carcinoma 0.02535736 Brain Lower Grade Glioma 0.008546297 Liver hepatocellular carcinoma 0.0054378435 Lung adenocarcinoma 0.095628574 Lung squamous cell carcinoma 0.092482924 Mesothelioma 0.013800432 Ovarian serous cystadenocarcinoma 0.0051462846 Pancreatic adenocarcinoma 0.045574322 Pheochromocytoma and Paraganglioma 0.008079847 Prostate adenocarcinoma 0.0074080774 Rectum adenocarcinoma 0.023419868 Sarcoma 0.03161704 Skin Cutaneous Melanoma 0.0132164 Stomach adenocarcinoma 0.02176549 Testicular Germ Cell Tumors 0.0019676166 Thyroid carcinoma 0.0010658016 Thymoma 0.045158256 Uterine Corpus Endometrial Carcinoma 0.017464066 Uterine Carcinosarcoma 0.017408738 Uveal Melanoma 0.012296385 predicted_class Cervical squamous cell carcinoma and endocervical adenocarcinoma probability 0.10849975

jonesse3 commented 2 weeks ago

@satishbioinfo Thank you for replying to @asaki1986!

@asaki1986 Is there anything we can do on our end to improve the documentation?

Our tool can only predict up to 32 primary tumors from TCGA. The sample you provided, case-id "0455295c-ea0f-408a-acca-a9422e054d66", is from the TARGET-AML project. It's possible that this sample is actually Acute Myeloid Leukemia, which is not one of the 32 primary tumors we trained the models on. It would be interesting to see what the predicted class will be if we retrained the model with AML samples.

asaki1986 commented 2 weeks ago

@satishbioinfo Thank you for replying to @asaki1986!

@asaki1986 Is there anything we can do on our end to improve the documentation?

Our tool can only predict up to 32 primary tumors from TCGA. The sample you provided, case-id "0455295c-ea0f-408a-acca-a9422e054d66", is from the TARGET-AML project. It's possible that this sample is actually Acute Myeloid Leukemia, which is not one of the 32 primary tumors we trained the models on. It would be interesting to see what the predicted class will be if we retrained the model with AML samples.

Hi, @jonesse3

Yes, it indeed required several steps to convert the data as example. Even though the RPKM_UP values were provided, we still need to update the data exactly the same as provided.

As cancer type prediction, we tried several data from RNASeq data, among which are types of lung, colon, and sarcomas. Samples are predicted quite well besides GW23T369B04, which were considered as high-grade sarcoma.

I've upload the data, it will be great if you can take a glance.

GW23T369B04.fpkm_uq.csv

Best

CBIIT / TULIP

runing error #5

Layer (type) Output Shape Param #

activation_4 (Activation) (None, 32) 0