atb-data / neoantigen-landscape-msi

Scripts for the preprint "The shared neoantigen landscape of MSI cancers reflects immunoediting during tumor evolution".
MIT License
1 stars 1 forks source link

Documentation on input formats for ReFrame #1

Open indapa opened 4 years ago

indapa commented 4 years ago

I was really excited to read your preprint using the ReFrame software you implemented. I was able to install the libraries and conda environments to run the example datasets described in the readme.

I want to run the software on my own data, but I am really confused about the formatting of the input data. Can you describe what each of the columns represent? Or provide more guidance how users can analyze their own data with ReFrame?

mjendrusch commented 4 years ago

I'm sorry for the very late reply.

I have now added a readme-file to ReFrame to better describe the input format:

https://github.com/atb-data/neoantigen-landscape-msi/tree/master/ReFrame

Long story short - you need two types of files to apply ReFrame to your own data:

Now, the actual data you need is the heights of the peaks from your fragment-size analysis for the main peak size (MP, usually the highest peak in a non-mutated reference-sample) and the secondary peaks at sizes MP - 4 (L4) to MP + 3 (R3). In each of your input files, you'll want to have a sheet named "Heights" (case sensitive) with the following format:

Tumor ID Run ID L4 L3 L2 L1 MP R1 R2 R3
ignored ignored L4-size L3-size L2-size L1-size MP-size R1-size R2-size R3-size
tumor-identifier run-identifier peak-height L4 peak-height L3 peak-height L2 peak-height L1 main-peak-height peak-height R1 peak-height R2 peak-height R3
tumor-identifier run-identifier ... ... ... ... ... ... ... ...

With one line of "tumor-identifier run-identifier ..." for each tumor sample you have. Each file in the in directory should contain the data for one marker of interest.

Now, for the reference file reflist.xlsx, you'll want the following format:

Gene ID Run ID L4 L3 L2 L1 MP R1 R2 R3
gene-or-marker-id ignored median-relative-peak-height-L4 median-relative-peak-height-L3 median-relative-peak-height-L2 median-relative-peak-height-L1 median-relative-peak-height-MP median-relative-peak-height-R1 median-relative-peak-height-R2 median-relative-peak-height-R3 median-relative-peak-height-R4
... ... ... ... ... ... ... ... ... ...

with each row containing the *median relative peak height' for its corresponding peak. That is, take the median peak heights for each marker of interest and compute the relative peak height as (peak-height) / (sum of peak-heights).

I hope this answers some of your questions. If anything else is unclear, I'll make sure to answer as soon as possible and add it to the readme.

indapa commented 4 years ago

Dear Michael and Alexej

Thank you for your response. The readme is very helpful and I am getting a better understanding of the input format

I have a data set of 6 samples that were run on capillary electrophoresis. 3 of them are reference (wildtype) and 3 others have a cMS frameshift (based on cell line data) I want to use ReFrame to confirm the presence of the frameshifts.

I'm still confused how to get the relative peak heights. To get the relative peak heights to populate reflist.xslx, does the sum of peak heights equal to sum of peak heights L1-4,MP, R1-4 for each reference sample?

relative peak height peak-height / sum of peak-heights

I am still a little confused on how to get the relative peak height for the reflist.xslx. To compute the sum of peak heights do I add the peak heights of L1-R4,MP, R1-R4 for each reference sample? Then for 3 samples that have a cMS, I just get the peak heights for the L1-4, MP, and R1-4

I was not able to get GeneMapper to work on my machine, but the equivalent ThermoFisher cloud application MSA is able to export peak heights. If you can confirm my understanding of how to compute relative peak heights, then I should be able to prepare the input files for my own data.

Again, thank you so much for your responsive help and updated Readme on your git repo for ReFrame.

Sincerely, Amit

On Thu, Jul 16, 2020 at 3:39 AM Michael Jendrusch notifications@github.com wrote:

I'm sorry for the very late reply.

I have now added a readme-file to ReFrame to better describe the input format:

https://github.com/atb-data/neoantigen-landscape-msi/tree/master/ReFrame

Long story short - you need two types of files to apply ReFrame to your own data:

  • a reference peak-height file for normalisation reflist.xlsx.
  • your actual input xlsx files in the in directory.

Now, the actual data you need is the heights of the peaks from your fragment-size analysis for the main peak size (MP, usually the highest peak in a non-mutated reference-sample) and the secondary peaks at sizes MP

  • 4 (L4) to MP + 3 (R3). In each of your input files, you'll want to have a sheet named "Heights" (case sensitive) with the following format: Tumor ID Run ID L4 L3 L2 L1 MP R1 R2 R3 ignored ignored L4-size L3-size L2-size L1-size MP-size R1-size R2-size R3-size tumor-identifier run-identifier peak-height L4 peak-height L3 peak-height L2 peak-height L1 main-peak-height peak-height R1 peak-height R2 peak-height R3 tumor-identifier run-identifier ... ... ... ... ... ... ... ...

With one line of "tumor-identifier run-identifier ..." for each tumor sample you have. Each file in the in directory should contain the data for one marker of interest.

Now, for the reference file reflist.xlsx, you'll want the following format: Gene ID Run ID L4 L3 L2 L1 MP R1 R2 R3 gene-or-marker-id ignored median-relative-peak-height-L4 median-relative-peak-height-L3 median-relative-peak-height-L2 median-relative-peak-height-L1 median-relative-peak-height-MP median-relative-peak-height-R1 median-relative-peak-height-R2 median-relative-peak-height-R3 ... ... ... ... ... ... ... ... ... ...

with each row containing the *median relative peak height' for its corresponding peak. That is, take the median peak heights for each marker of interest and compute the relative peak height as (peak-height) / (sum of peak-heights).

I hope this answers some of your questions. If anything else is unclear, I'll make sure to answer as soon as possible and add it to the readme.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/atb-data/neoantigen-landscape-msi/issues/1#issuecomment-659328309, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADPQLQ6YP3YI4YS4UOKRM3R33KHTANCNFSM4NKPMMVQ .

mjendrusch commented 4 years ago

Right, so for the reflist, you take your reference samples (non-mutated) and for each marker, you then take the median of peak heights for all reference samples for that marker. Then, you should have median peak heights L1-4 MP R1-4 for each marker. To get the relative peak heights, you sum over these peak heights for each marker and divide the peak heights by the sum. So now, for each marker, you should have relative peak heights L1-4 MP R1-4 which sum to 1.

I'll update the README with a better description of how to set up the reflist.xlsx.