Help with merge_samples_to_pseudobulk rule and reproducing published results

Hello,

First of all, thank you for the very nice tool you have created; it has been very useful in my research!

I would like to ask for your assistance with a couple of issues I've encountered:

Rule merge_samples_to_pseudobulk

I'm noticing that the input for this rule is a file with the sample name repeated multiple times (corresponding to the number of batches, in my case 297). This seems incorrect, as I believe there should only be one input file when there is only one sample (name_sample.bam) and the output should be name_sample.merged.bam, which would be the same as the input in cases where there is only one sample. This current behavior not only leads to potential discrepancies in results but also significantly slows down the pipeline. To circumvent this, I've been skipping this rule, but I realize this won't be feasible when processing multiple samples together. Do you also have this problem when running the pipeline?

Reproducing Results from N87 Cancer Cell Line

I have run the pipeline on data from the N87 cancer cell line, as done in your published paper. However, I am not obtaining the same variants as those listed in your supplementary table, despite using the default parameters for mutation calling. I've attached the final genotypes.txt file I obtained for this N87 sample. Could you please verify if this matches your results? If not, do you have any suggestion for potential reasons for the discrepancy? Additionally, your paper mentions validation through clonal analysis. Could you share the clonal annotations for this dataset? Ideally, a dictionary with each of the four clones as keys and a list of corresponding cell barcodes would be extremely helpful.

If you need any further clarification to address these issues, please let me know. Thank you in advance for your time and help,

Sara Costa genotypes.txt

Hi Sara,

Thanks for your email. At first look, I think you've identified some issues. Let me take a closer look at this, and I'll get back to you.

Jeff

On Jun 28, 2024, at 6:09 AM, saracosta261299 @.***> wrote:

Hello, First of all, thank you for the very nice tool you have created; it has been very useful in my research! I would like to ask for your assistance with a couple of issues I've encountered: Rule merge_samples_to_pseudobulk I'm noticing that the input for this rule is a file with the sample name repeated multiple times (corresponding to the number of batches, in my case 297). This seems incorrect, as I believe there should only be one input file when there is only one sample (name_sample.bam) and the output should be name_sample.merged.bam, which would be the same as the input in cases where there is only one sample. This current behavior not only leads to potential discrepancies in results but also significantly slows down the pipeline. To circumvent this, I've been skipping this rule, but I realize this won't be feasible when processing multiple samples together. Do you also have this problem when running the pipeline? Reproducing Results from N87 Cancer Cell Line I have run the pipeline on data from the N87 cancer cell line, as done in your published paper. However, I am not obtaining the same variants as those listed in your supplementary table, despite using the default parameters for mutation calling. I've attached the final genotypes.txt file I obtained for this N87 sample. Could you please verify if this matches your results? If not, do you have any suggestion for potential reasons for the discrepancy? Additionally, your paper mentions validation through clonal analysis. Could you share the clonal annotations for this dataset? Ideally, a dictionary with each of the four clones as keys and a list of corresponding cell barcodes would be extremely helpful. If you need any further clarification to address these issues, please let me know. Thank you in advance for your time and help, Sara Costa genotypes.txt — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

Hi Jeff,

Thank you so much for your answer and for taking the time to help me. I would like to update you on the first thing I mentioned, about the validation of the mutations.

I realized I used in my analysis a reference genome different from the one you used (CRCh38 instead of CRCh37). Because of that, I couldn't find the same mutations as you did. I converted the positions from one genome to the other and this time I was able to find 3 of the 4 positions you show in the supplementary table for the NC87 line, when calling the mutations in pseudobulk. However, after filtering and assessing the mutations at the single cell level, only 1 of the 3 initial mutations remain in the final genotypes.txt file... I used the default filtering parameters you have in the snakemake you provide, so that is a bit odd.

Let me know what you think and if I can provide you with any other information. Also, if it would be easier to have a short video call and discuss these issues, I am available for that.

Sara

On Fri, 5 Jul 2024 at 23:46, Jeffrey Chang @.***> wrote:

Hi Sara,

Thanks for your email. At first look, I think you've identified some issues. Let me take a closer look at this, and I'll get back to you.

Jeff

On Jun 28, 2024, at 6:09 AM, saracosta261299 @.***> wrote:

Hello, First of all, thank you for the very nice tool you have created; it has been very useful in my research! I would like to ask for your assistance with a couple of issues I've encountered: Rule merge_samples_to_pseudobulk I'm noticing that the input for this rule is a file with the sample name repeated multiple times (corresponding to the number of batches, in my case 297). This seems incorrect, as I believe there should only be one input file when there is only one sample (name_sample.bam) and the output should be name_sample.merged.bam, which would be the same as the input in cases where there is only one sample. This current behavior not only leads to potential discrepancies in results but also significantly slows down the pipeline. To circumvent this, I've been skipping this rule, but I realize this won't be feasible when processing multiple samples together. Do you also have this problem when running the pipeline? Reproducing Results from N87 Cancer Cell Line I have run the pipeline on data from the N87 cancer cell line, as done in your published paper. However, I am not obtaining the same variants as those listed in your supplementary table, despite using the default parameters for mutation calling. I've attached the final genotypes.txt file I obtained for this N87 sample. Could you please verify if this matches your results? If not, do you have any suggestion for potential reasons for the discrepancy? Additionally, your paper mentions validation through clonal analysis. Could you share the clonal annotations for this dataset? Ideally, a dictionary with each of the four clones as keys and a list of corresponding cell barcodes would be extremely helpful. If you need any further clarification to address these issues, please let me know. Thank you in advance for your time and help, Sara Costa genotypes.txt — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/U54Bioinformatics/PhylinSic_Project/issues/3#issuecomment-2211445963, or unsubscribe https://github.com/notifications/unsubscribe-auth/BCTGXAOI7OCNBBR5LTIWOLDZK4H3LAVCNFSM6AAAAABKBU336SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJRGQ2DKOJWGM . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Thanks for the information.

I have looked into the first issue, with merging to pseudobulk. There was a bug where samples with multiple batches were being replicated. I have fixed this, and the version in the github should be merging correctly now. Please update, and I apologize for the problems.

I will take a look at the N87 issue. The scientist who did the original analysis has left the lab, so bear with me as I try to recreate this analysis.

Thanks, Jeff

On Jul 9, 2024, at 3:18 AM, saracosta261299 @.***> wrote:

Hi Jeff,

Thank you so much for your answer and for taking the time to help me. I would like to update you on the first thing I mentioned, about the validation of the mutations.

I realized I used in my analysis a reference genome different from the one you used (CRCh38 instead of CRCh37). Because of that, I couldn't find the same mutations as you did. I converted the positions from one genome to the other and this time I was able to find 3 of the 4 positions you show in the supplementary table for the NC87 line, when calling the mutations in pseudobulk. However, after filtering and assessing the mutations at the single cell level, only 1 of the 3 initial mutations remain in the final genotypes.txt file... I used the default filtering parameters you have in the snakemake you provide, so that is a bit odd.

Let me know what you think and if I can provide you with any other information. Also, if it would be easier to have a short video call and discuss these issues, I am available for that.

Sara

On Fri, 5 Jul 2024 at 23:46, Jeffrey Chang @.***> wrote:

Hi Sara,

Thanks for your email. At first look, I think you've identified some issues. Let me take a closer look at this, and I'll get back to you.

Jeff

On Jun 28, 2024, at 6:09 AM, saracosta261299 @.***> wrote:

Hello, First of all, thank you for the very nice tool you have created; it has been very useful in my research! I would like to ask for your assistance with a couple of issues I've encountered: Rule merge_samples_to_pseudobulk I'm noticing that the input for this rule is a file with the sample name repeated multiple times (corresponding to the number of batches, in my case 297). This seems incorrect, as I believe there should only be one input file when there is only one sample (name_sample.bam) and the output should be name_sample.merged.bam, which would be the same as the input in cases where there is only one sample. This current behavior not only leads to potential discrepancies in results but also significantly slows down the pipeline. To circumvent this, I've been skipping this rule, but I realize this won't be feasible when processing multiple samples together. Do you also have this problem when running the pipeline? Reproducing Results from N87 Cancer Cell Line I have run the pipeline on data from the N87 cancer cell line, as done in your published paper. However, I am not obtaining the same variants as those listed in your supplementary table, despite using the default parameters for mutation calling. I've attached the final genotypes.txt file I obtained for this N87 sample. Could you please verify if this matches your results? If not, do you have any suggestion for potential reasons for the discrepancy? Additionally, your paper mentions validation through clonal analysis. Could you share the clonal annotations for this dataset? Ideally, a dictionary with each of the four clones as keys and a list of corresponding cell barcodes would be extremely helpful. If you need any further clarification to address these issues, please let me know. Thank you in advance for your time and help, Sara Costa genotypes.txt — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/U54Bioinformatics/PhylinSic_Project/issues/3#issuecomment-2211445963, or unsubscribe https://github.com/notifications/unsubscribe-auth/BCTGXAOI7OCNBBR5LTIWOLDZK4H3LAVCNFSM6AAAAABKBU336SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJRGQ2DKOJWGM . You are receiving this because you are subscribed to this thread.Message ID: @.***>

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

Hi Jeff,

Thank you for the update and for fixing the bug! I appreciate your efforts and will wait for the remaining updates as you work to recreate the analysis. I understand it can take some time.

Please let me know if you need any additional information in the meantime.

Thanks, Sara

On Tue, 9 Jul 2024 at 18:31, Jeffrey Chang @.***> wrote:

Thanks for the information.

I have looked into the first issue, with merging to pseudobulk. There was a bug where samples with multiple batches were being replicated. I have fixed this, and the version in the github should be merging correctly now. Please update, and I apologize for the problems.

I will take a look at the N87 issue. The scientist who did the original analysis has left the lab, so bear with me as I try to recreate this analysis.

Thanks, Jeff

On Jul 9, 2024, at 3:18 AM, saracosta261299 @.***> wrote:

Hi Jeff,

Thank you so much for your answer and for taking the time to help me. I would like to update you on the first thing I mentioned, about the validation of the mutations.

I realized I used in my analysis a reference genome different from the one you used (CRCh38 instead of CRCh37). Because of that, I couldn't find the same mutations as you did. I converted the positions from one genome to the other and this time I was able to find 3 of the 4 positions you show in the supplementary table for the NC87 line, when calling the mutations in pseudobulk. However, after filtering and assessing the mutations at the single cell level, only 1 of the 3 initial mutations remain in the final genotypes.txt file... I used the default filtering parameters you have in the snakemake you provide, so that is a bit odd.

Let me know what you think and if I can provide you with any other information. Also, if it would be easier to have a short video call and discuss these issues, I am available for that.

Sara

On Fri, 5 Jul 2024 at 23:46, Jeffrey Chang @.***> wrote:

Hi Sara,

Thanks for your email. At first look, I think you've identified some issues. Let me take a closer look at this, and I'll get back to you.

Jeff

On Jun 28, 2024, at 6:09 AM, saracosta261299 @.***> wrote:

Hello, First of all, thank you for the very nice tool you have created; it has been very useful in my research! I would like to ask for your assistance with a couple of issues I've encountered: Rule merge_samples_to_pseudobulk I'm noticing that the input for this rule is a file with the sample name repeated multiple times (corresponding to the number of batches, in my case 297). This seems incorrect, as I believe there should only be one input file when there is only one sample (name_sample.bam) and the output should be name_sample.merged.bam, which would be the same as the input in cases where there is only one sample. This current behavior not only leads to potential discrepancies in results but also significantly slows down the pipeline. To circumvent this, I've been skipping this rule, but I realize this won't be feasible when processing multiple samples together. Do you also have this problem when running the pipeline? Reproducing Results from N87 Cancer Cell Line I have run the pipeline on data from the N87 cancer cell line, as done in your published paper. However, I am not obtaining the same variants as those listed in your supplementary table, despite using the default parameters for mutation calling. I've attached the final genotypes.txt file I obtained for this N87 sample. Could you please verify if this matches your results? If not, do you have any suggestion for potential reasons for the discrepancy? Additionally, your paper mentions validation through clonal analysis. Could you share the clonal annotations for this dataset? Ideally, a dictionary with each of the four clones as keys and a list of corresponding cell barcodes would be extremely helpful. If you need any further clarification to address these issues, please let me know. Thank you in advance for your time and help, Sara Costa genotypes.txt — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

— Reply to this email directly, view it on GitHub < https://github.com/U54Bioinformatics/PhylinSic_Project/issues/3#issuecomment-2211445963>,

or unsubscribe < https://github.com/notifications/unsubscribe-auth/BCTGXAOI7OCNBBR5LTIWOLDZK4H3LAVCNFSM6AAAAABKBU336SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJRGQ2DKOJWGM>

. You are receiving this because you are subscribed to this thread.Message ID: @.***>

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/U54Bioinformatics/PhylinSic_Project/issues/3#issuecomment-2218146395, or unsubscribe https://github.com/notifications/unsubscribe-auth/BCTGXAIPG5YQLLPWWGMF32LZLQF65AVCNFSM6AAAAABKBU336SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJYGE2DMMZZGU . You are receiving this because you are subscribed to this thread.Message ID: @.***>

U54Bioinformatics / PhylinSic_Project

Help with merge_samples_to_pseudobulk rule and reproducing published results #3