Effect of Irrelevant Bulk RNA-Seq Sample and Selection of Optimal Projects for Test Data

Hi,

Thank you very much for putting together this code.

I would like to better understand when Bulk2Space might help versus when there are limits to applicability to Bulk2Space, following a journal club presentation where I learned more about the paper and method.

I apologize that I am not sure how best to precisely ask my question, but I have tried to use a few examples to try and give a sense of what I am asking about.

Example 1 (Exact Code for Concrete Test):

In the spirit of a GitHub “issue,” I tried to start with concrete examples for discussion based upon issue #8 .

I have attached a summary of that analysis (PDAC_Test.pdf), and I have also attached any input files not already provided on this repository.

However, when I changed the bulk RNA-Seq gene symbols in order to use the same gene symbols for both the PDAC example and the demo1 example, I lost the Ductal cells in the PDAC example that otherwise still used only files derived from the same samples used for the PDAC example. I also have some more details notes in the uploaded PDF.

Nevertheless, if that might possibly help the discussion, I have provided those.

If there are any other relatively small files that it would help to upload to GitHub, then I would also be happy to add those. For example, I also ran the analysis with epoch_num=1000 instead of epoch_num=3500. I am currently not providing those results, but my impression is that they look qualitatively similar in terms of cancer cell and ductal cell assignments (for all of the provided PDAC files).

Example 2 (Theoretical Question):

Is it possible to run bulk2spatial as described below?

1) Use bulk RNA-Seq + scRNA-Seq + spatial data that all come from Patient A.

2) Export model from Patient A.

3) Only provide bulk RNA-Seq data from patient B, and test how predictions from model defined on Patient A compare to scRNA-Seq and spatial data generated for Patient B.

Additionally, if I understand correctly, then I think an image for the tissue for Patient B can not be provided. If so, I think the shape of the issue section for Patient B can’t be known, and I would guess the spatial coordinates from Bulk2Space might not be directly applicable to interpret Patient B. However, if I might be misunderstanding anything, then please let me know.

Example 3 (Summary Questions):

Am I correctly understanding that consecutive slides are often used in the paper? For example, the 2 slices in Figure S17f already have different shapes, and it looks like you a projection of estimations on the histology image for slide 2 was not (or could not?) be provided.

Data from different patients would be even more different. So, is it reasonable and/or correct to say that there is a preference to use all 3 data types generated from the same experiment? Even if the exact slice is not the same, the true composition of the multiple data types can hopefully be as close as possible?

For example, I am not sure if the difference is sufficiently extreme, but let’s say Patient A has histology like the “Inflammation” sample in Figure 6 and Patient B has histology like the “Cancer” sample in Figure 6. If you didn’t have a spatial transcriptomics (ST) dataset for Patient B, then I think use of the ST data from Patient A might not be of much benefit to Patient B. Do you think that is a fair conclusion?

Similarly, if your training sample had 90% tumor, then I would expect limitations is looking at the projection from a spatial transcriptomics project where the tissues had a very different percent tumor such as closer to 20% tumor. I would also expect there often could be a challenge in even knowing the general shape of an independent/unrelated tumor sample, and I believe that you should not be able to know the spatial information for the tumor cells within an independent tissue without a more direct measurement.

I am not sure if the points above might also possibly relate to the shift in the frequency of cancer cells per spot with the reduced/matching gene symbols in the uploaded PDF for Example 1.

However, if I am then understanding correctly, then might that be at least somewhat contradictory to what I believe is a recommendation to use public data in issue #7? If I might be misunderstanding anything, then please let me know.

Thank you very much for your help!

Sincerely, Charles

Code.zip demo1_bulk-FALSE_PDAC_LABEL.csv demo1_bulk-FALSE_PDAC_LABEL-MATCHING_SUBSET.csv pdac_bulk-MATCHING_SUBSET.csv

SC Cell_Type_Counts.pdf SC Cell_Type_Correlation.pdf ST Spot_Deconvolution.pdf ST Cancer_Cells_per_Spot.pdf

PDAC_Test.pdf

ZJUFanLab / bulk2space

Effect of Irrelevant Bulk RNA-Seq Sample and Selection of Optimal Projects for Test Data #15