Recommend parameters for FT-ICR data?

PeifengJi commented 3 years ago

Hi provim developer: First of all, thank for developing this awesome tools, it's very useful. I'a planning to use this software to analyze my MALDI-MSI data sets generated by FT-ICR, which is featured with extremely high resolution. So, is there any suggestion to set the parameters, like resolution, max shift and et al.

Thank you!

Peifeng

Kawue commented 3 years ago

Hi Peifeng,

glad that the tool is useful for you. I am not familiar with FT-IRC but I assume that our default parameters for the Orbitrap should work, as this is also of high spectral resolution.

I also just realized that I have not offered an option to change these parameters, well ..... Currently I have no time to fix this but you can just download the code and add your custom parameters to the workflow_pybasis.py below line 55, similar to the tof and orbitrap blocks.

If you data has a reasonable higher resolution than orbitrap, then I would propose to reduce the cmzbinsize, you could start with 0.0001, i.e. one tenth smaller. This actually decides the granularity of the discretization of your spectral data. You could then try to leave the mzmaxshift as it is or also to reduce it by one tenth to 0.01. In general you can say that the higher the difference between cmzbinsize and mzmaxshift the more likely are wrong alignments of peaks. Especially with very high spectral resolution data you want to keep mzmaxshift small.

The remaining parameters can stay on default, although you might want to define intranorm_offset to ignore noisy signals as their "alignment" could drastically slow down the method with higher resolution data. However, I cannot give you a value for that, as it depends on your data set. Looking at the mean spectrum using your vendor software could help with that.

Sadly I have not implemented a method to statistically analyze the alignment result. So you have to evaluate that by yourself.

I also wrote a former colleague of mine, who is from the department of biology and ask about her opinion. I will share ehr answer as soon as she answered me.

Best regards, Kawue

PeifengJi commented 3 years ago

@Kawue Thanks for the quick reply. It is very useful. I have tried to reduce the cmzbinsize to 0.001 for one imzML file for testing. Unfortunately, this took so large mount of memory that I have to kill the process, which was stoped at the get_refrence step (in the import_imzml module, pyBasis package). I have checked the number of mz in the file (raw data) and it turns out to be 1,441,517. I think the number is too big, where the large majority of these mz should be noisy signal. So, should I firstly reduce the number of mz by setting a cutoff based on intensity (ranges from 1e5 to 5e8)? Is there any impact on the downstream analysis? Thank you !

Best regards,

Peifeng

Kawue commented 3 years ago

Ok, I have multiple questions.

You write you reduced it to 0.001 is that correct or did you miss a 0, because 0.001 is already the default of our Orbitrap setting.
What are the intensity ranges you described? 1e5 seems quite high for a lower boundary or did you mean the lowest reasonable leak to pick?
I think setting the cutoff is a reasonable good choice which should reduce the memory and especially computation time. However, I never used this setting for my data. I have not worked with the tool for about one year, as I left university. I just looked into the code, which for this step was written by the pyBASIS team and it looks like the values below the cutoff value are just set to 0. Of course this can impact the downstream analysis. If you choose the cutoff too high, you will remove low characteristic signals. If it is too low, you might just eliminate a part of the noise. I am not sure but I assume it could be possible, that these remaining noise signals appear to look like normal signals, which is the worst case in my opinion.
Do you use the tool for an individual file or multiple at the same time? So do you need only the intra data normalization or also inter data normalization?

You have to keep in mind that all data transformations impacts the data and we never know for sure if we "correct" something to achieve it's original state or if we introduce wrong information with our transformations. That counts for basically all methods that influence the data directly. It is totally fine to do so, as we have to transform the data at some point, since the measurements are not correct either, otherwise there would be no need for alignments. So when you are really unsure about some effects, the only way is to use multiple variants of transformations, to analyse the effect on the downstream results and hopefully identify the "correct" approach or let's say the approach that keeps the main characteristics of your data.

PeifengJi commented 3 years ago

To answer the questions:

Yes, it is 0.001. I found the memory increased dramatically with reducing cmzbinsize. I have to test if my computer could analyze the data.
I just plot the raw data, where x axis is mz, y axis is their corresponding intensity. You can see from the plot:

download

I looked into the pyBASIS code, and found they built a reference mz using all the mz deposited in the file (1,441,517 in my case). So, in order to reduce the computational burden, I have to either set cutoff to reduce the mz list size or subsampling the mz list. I prefer the later one. After down-sampling mz list size to 10,000. Unfortunately, the get_reference function showed no change (take a huge number of memory ). So this function only closely related to the value of cmzbinsize.
I have multiple files in my data sets, but just test the software by using an individual imzML file.

Totally agree with your comments. I'm used to analying DNA/RNA sequencing data, which is just compose of A/T/C/G and new to metabolism, which is much more complex. Again, thank you very much!

PeifengJi commented 3 years ago

@Kawue Anyway, I have tried the whole pipeline provided and there is no issue about memory usage (Strange!) I'll move on my data sets. Thank you! BWT, unlike transcriptome analysis, there is few software available focused on multiple MSI data analysis. This tool is indeed helpful.

Kawue commented 3 years ago

Yeah the rarity of tools is a major problem. That's why we adapted pyBASIS for our approach. Sadly our pipeline is far less polished than I would like it to be. However, if you have no cluster for the memory issue available in the future, I would recommend thresholding instead of subsampling. In the end the first pipeline step is all about aligning and a sampled MZ value on your raw data in one pixel might have no intensity in another pixel, because they are not aligned yet. Also I have to say that the alignment procedure of pyBASIS is a super simple one. But this makes it easier to understand on the other side.

Be aware that inter alignment can lead to strange results, in the sense of misalignments. pyBASIS does not care about your data and will stupidly align what you hand over. If the measured samples are similar, so have similar molecular character, this should be fine. But keep it in mind if you encounter strange results at one point. As I said, we have no statistical evaluation tool available here.

If you also consider to use one of my other tools for analysis in the future it would be interesting to get feedback if they were helpful.

Fyi: it might happen that I answer late. As I said, I am not at university anymore but I try to maintain the tools and help as good as possible.

Kawue / provim

Recommend parameters for FT-ICR data? #5