caporaso-lab / student-microbiome-project

Central repository for data and analysis tools for the StudentMicrobiomeProject.
9 stars 3 forks source link

Trajectory clustering #14

Closed ElDeveloper closed 11 years ago

ElDeveloper commented 11 years ago

Assess similarities between different time trajectories across body-sites or even across individuals.

The idea here is to be able to find signals, where the x-axis is time and the y/z-axes are PC1/2. This same procedure could be used at the OTU level with time and abundance.

To address this issue interpolation may need to be solved, to have more dense signals.

antgonza commented 11 years ago

For the interpolations we have to options, or better said we can work on two levels:

ElDeveloper commented 11 years ago

Also Alex Washburn (a recent lab visitor) is working models to be able to interpolate at the OTU level which in the end would allow us to interpolate on the community level. The only "issues" he mentioned regarding using this, is the fact that all the code is currently written in MATLAB and that it still needs some work.

I will ask Will/Justin about the code as I don't know about it.

rob-knight commented 11 years ago

Great idea, this could be good dataset to point him at to see if techniques are valuable before recoding/integrating?

On Nov 18, 2012, at 11:15 AM, "Yoshiki" notifications@github.com<mailto:notifications@github.com> wrote:

Also Alex Washburn (a recent lab visitor) is working models to be able to interpolate at the OTU level which in the end would allow us to interpolate on the community level. The only "issues" he mentioned regarding using this, is the fact that all the code is currently written in MATLAB and that it still needs some work.

I will ask Will/Justin about the code as I don't know about it.

— Reply to this email directly or view it on GitHubhttps://github.com/gregcaporaso/student-microbiome-project/issues/14#issuecomment-10489184.

gregcaporaso commented 11 years ago

Thanks for getting this started Yoshiki. I'm going to assign you to this issue.

ElDeveloper commented 11 years ago

The beta-diversity plots were added in 31fcc30863adba3f16c9caab5388a55299870e67

ElDeveloper commented 11 years ago

Interpolation

I have been working on first solving the interpolation/resampling issue presented, this in turn will: make this data set compatible with the frequency-clustering method and will hopefully create a script that I have seen a couple users shown interest in.

I have opted for the OTU level resampling. To do so, the general steps that I have sketched are very similar to what's done with multiple rarefactions:

  1. Begin with a non-rarefied OTU-table and select: a rarefaction depth and a number of iterations.
  2. Specify what the final sampling period† you want your mapping file and OTU table to have.
  3. Sort the samples using the time gradient and insert/remove the samples that are not needed to have the "final sampling period". This step should be performed in the mapping file and the OTU table.
  4. For each iteration:
    • Generate a rarefied OTU table.
    • For each OTU in this rarefied OTU table:
      • Use a linear/cuadratic †† interpolation method to fill in the gaps that were added in step 3.
    • Drop all the samples from the table that are not needed.
    • Save this interpolated table.
  5. Compute the mean of all the OTU tables that were saved and generated in step 4.

This processing is very similar to what others do with microarray data. Though using "multiple rarefactions" here, which is not something that I have seen an equivalent in micrarray data analysis.

I would really appreciate if people could make make some suggestions regarding this. I have some work done, though I still have a good part to go (specifically the add/remove samples turned out to be kinda problematic).

Clustering

I will try to summarize what the general steps are in this clustering procedure, if you would like to get a more in-depth explanation of the algorithm, please see section II of this file.

Take as an input a rarefied and evenly spaced in time OTU table.

The outpus is a list, where each element is a group of OTUs that resulted to be very similar according to their frequency characteristics. What this means is that you are capable of finding time-lagged related OTUs in the same group, or even non-time-lagged related OTUs in the same group.

This same algorithm can be used with PCoA plots instead of OTU tables, though the general limitation is an even-sampling depth, hence the first section of this document.


† Sampling period: time between samples in the mapping file.

†† Ideally the final algorithm will not be limited to use a specific interpolation method and should be capable of using distinct methods.

antgonza commented 11 years ago

+1

rob-knight commented 11 years ago

This looks great -- thanks, Yoshiki!

Rob

On Jan 22, 2013, at 12:48 AM, Yoshiki notifications@github.com<mailto:notifications@github.com> wrote:

Interpolation

I have been working on first solving the interpolation/resampling issue presented, this in turn will: make this data set compatible with the frequency-clustering method and will hopefully create a script that I have seen a couple users shown interest in.

I have opted for the OTU level resampling. To do so, the general steps that I have sketched are very similar to what's done with multiple rarefactions:

  1. Begin with a non-rarefied OTU-table and select: a rarefaction depth and a number of iterations.
  2. Specify what the final sampling period† you want your mapping file and OTU table to have.
  3. Sort the samples using the time gradient and insert/remove the samples that are not needed to have the "final sampling period". This step should be performed in the mapping file and the OTU table.
  4. For each iteration:
    • Generate a rarefied OTU table.
    • For each OTU in this rarefied OTU table:
      • Use a linear/cuadratic †† interpolation method to fill in the gaps that were added in step 3.
    • Drop all the samples from the table that are not needed.
    • Save this interpolated table.
  5. Compute the mean of all the OTU tables that were saved and generated in step 4.

This processing is very similar to what others do with microarray data. Though using "multiple rarefactions" here, which is not something that I have seen an equivalent in micrarray data analysis.

I would really appreciate if people could make make some suggestions regarding this. I have some work done, though I still have a good part to go (specifically the add/remove samples turned out to be kinda problematic).

Clustering

I will try to summarize what the general steps are in this clustering procedure, if you would like to get a more in-depth explanation of the algorithm, please see section II of this filehttp://cl.ly/3J3m1B1Z0a2T.

Take as an input a rarefied and evenly spaced in time OTU table.

The outpus is a list, where each element is a group of OTUs that resulted to be very similar according to their frequency characteristics. What this means is that you are capable of finding time-lagged related OTUs in the same group, or even non-time-lagged related OTUs in the same group.

This same algorithm can be used with PCoA plots instead of OTU tables, though the general limitation is an even-sampling depth, hence the first section of this document.


† Sampling period: time between samples in the mapping file.

†† Ideally the final algorithm will not be limited to use a specific interpolation method and should be capable of using distinct methods.

— Reply to this email directly or view it on GitHubhttps://github.com/gregcaporaso/student-microbiome-project/issues/14#issuecomment-12533831.

gregcaporaso commented 11 years ago

Yes, this sounds really interesting. What is the issue that you're having with adding/removing samples?

ElDeveloper commented 11 years ago

One of the key problems that I really didn't consider at a first glance and that @antgonza pointed was that: we currently don't know what an ideal sampling period for this type of data is. This matters a lot, because the result of this data imputation procedure (as it is) would be very biased and at the end of the day not really helpful. Another thing that was not being considered (that would also be really helpful), was taking into account similar treatments, and their development over time to be able to make imputations of the missing points for subjects within the same treatment, this I think would be an awesome thing to integrate.

Additionally, the previously outlined steps changed and this now does not seem like something that will be ready in time as I know the intention here is to start writing up the manuscript sometime soon.

Nonetheless, we also agreed that this dataset could be useful for method testing once we have this in place, which is out of the scope of this paper.

I will still proceed with the clustering, although not with all the subjects.

gregcaporaso commented 11 years ago

Could you subsample from the Moving Pictures data set to identify a useful sampling period (even if it's not ideal)?

ElDeveloper commented 11 years ago

Yes that does sound like a good idea, this was also pointed by @antgonza.

ElDeveloper commented 11 years ago

I've added a couple files to Issue_14. @antgonza and I are going to work on the google doc and hopefully we will have it ready by tomorrow.

gregcaporaso commented 11 years ago

Thanks! Really excited to work on this.

ElDeveloper commented 11 years ago

Sorry, we sent out this document to the bioinfo list but I'm not sure if everyone in this project gets that info.

An explanation of the volatility analysis can be found here.

gregcaporaso commented 11 years ago

@ElDeveloper has not been getting interesting results with the approaches described here, and is going to focus on #31.