biocore / songbird

Vanilla regression methods for microbiome differential abundance analysis
BSD 3-Clause "New" or "Revised" License
58 stars 25 forks source link

songbird multinomial treating variables as categorical when they are numeric #151

Closed bck243 closed 3 years ago

bck243 commented 3 years ago

Hello!

I was wondering how to explicitly specify that my input variables are numeric (continuous) rather than categorical?

In the examples, songbird multinomial defaults to variables being numeric (e.g. with "--formula "Depth+Temperature+Salinity+Oxygen+Fluorescence+Nitrate" ), but when I run it on my data, it is defaulting to categorical and creating a column for each possible value of the variable.

For example, when I run qiime songbird multinomial on pressure :

#!/usr/bin/env python
import os
os.system('source activate qiime2-2020.6 && qiime songbird multinomial \
  --i-table ./data/P18_16S_all_runs_raw_no_contam_or_control2.qza \
  --m-metadata-file ./data/iTAG_metadata_16S_all_for_q2_8_2020.tsv \
  --p-formula "CTDPRS" \
  --p-epochs 10000 \
  --p-differential-prior 0.5 \
  --p-summary-interval 1 \
  --o-differentials CTDPRS_differentials.qza \
  --o-regression-stats CTDPRS_regression-stats.qza \
  --o-regression-biplot CTDPRS_regression-biplot.qza')

I get this resulting table:

> head -n 3 CTDPRS_differentials.tsv
featureid   Intercept   CTDPRS[T.1025.4]    CTDPRS[T.1029]  CTDPRS[T.1029.3]    CTDPRS[T.1050]  CTDPRS[T.1070.6]    CTDPRS[T.1075.9]    CTDPRS[T.1089.7]    CTDPRS[T.1122.6]    CTDPRS[T.1129.5]    CTDPRS[T.1129.6]    CTDPRS[T.1130.1]    CTDPRS[T.1169.9]    CTDPRS[T.1174.1]    CTDPRS[T.120.4] CTDPRS[T.120.5] CTDPRS[T.1217.9]    CTDPRS[T.1218.1]    CTDPRS[T.1222.1]    CTDPRS[T.1222.7]    CTDPRS[T.1236]  CTDPRS[T.125.6] CTDPRS[T.1256]  CTDPRS[T.1269.7]    CTDPRS[T.1298.3]    CTDPRS[T.1306.2]    CTDPRS[T.1311]  CTDPRS[T.1332.9]    CTDPRS[T.1334.6]    CTDPRS[T.1366.9]    CTDPRS[T.1370]  CTDPRS[T.1420.2]    CTDPRS[T.1420.9]    CTDPRS[T.144.3] CTDPRS[T.1443.2]    CTDPRS[T.145.5] CTDPRS[T.150.9] CTDPRS[T.1501.8]    CTDPRS[T.1525.2]    CTDPRS[T.1547.2]    CTDPRS[T.155.2] CTDPRS[T.155.5] CTDPRS[T.1550.6]    CTDPRS[T.1555.2]    CTDPRS[T.1565.1]    CTDPRS[T.1569.2]    CTDPRS[T.160]   CTDPRS[T.1617.9]    CTDPRS[T.1630]  CTDPRS[T.164.7] CTDPRS[T.1654.7]    CTDPRS[T.1659.9]    CTDPRS[T.1690.1]    CTDPRS[T.1696.9]    CTDPRS[T.170.7] CTDPRS[T.171.1] CTDPRS[T.1749.7]    CTDPRS[T.1752.1]    CTDPRS[T.1758.5]    CTDPRS[T.1760.7]    CTDPRS[T.177.7] CTDPRS[T.18.6]  CTDPRS[T.180.5] CTDPRS[T.1840.2]    CTDPRS[T.1845.9]    CTDPRS[T.1851.1]    CTDPRS[T.190.9] CTDPRS[T.192.5] CTDPRS[T.1920.2]    CTDPRS[T.194.7] CTDPRS[T.197.8] CTDPRS[T.20.1]  CTDPRS[T.200.6] CTDPRS[T.2001.7]    CTDPRS[T.201]   CTDPRS[T.201.6] CTDPRS[T.2030.2]    CTDPRS[T.2049.9]    CTDPRS[T.2053.6]    CTDPRS[T.210.1] CTDPRS[T.2100.5]    CTDPRS[T.2130.5]    CTDPRS[T.2219.9]    CTDPRS[T.2260.8]    CTDPRS[T.2359.1]    CTDPRS[T.2360.1]    CTDPRS[T.2360.2]    CTDPRS[T.2364.7]    CTDPRS[T.2379.5]    CTDPRS[T.24.8]  CTDPRS[T.2461.8]    CTDPRS[T.2496.2]    CTDPRS[T.25.5]  CTDPRS[T.25.6]  CTDPRS[T.25.8]  CTDPRS[T.250.3] CTDPRS[T.255.2] CTDPRS[T.2560.2]    CTDPRS[T.2580.5]    CTDPRS[T.2589.4]    CTDPRS[T.2591.7]    CTDPRS[T.26.1]  CTDPRS[T.265.4] CTDPRS[T.2685.2]    CTDPRS[T.2774.5]    CTDPRS[T.2794.7]    CTDPRS[T.283.2] CTDPRS[T.284.2] CTDPRS[T.284.9] CTDPRS[T.2864.1]    CTDPRS[T.29.9]  CTDPRS[T.2900.6]    CTDPRS[T.2920.6]    CTDPRS[T.2930.2]    CTDPRS[T.2976.4]    CTDPRS[T.3.2]   CTDPRS[T.3.3]   CTDPRS[T.3.7]   CTDPRS[T.30]    CTDPRS[T.30.3]  CTDPRS[T.30.9]  CTDPRS[T.3006.2]    CTDPRS[T.3023.1]    CTDPRS[T.304.9] CTDPRS[T.3049.5]    CTDPRS[T.3049.7]    CTDPRS[T.3051]  CTDPRS[T.3065.7]    CTDPRS[T.3077.1]    CTDPRS[T.31.4]  CTDPRS[T.31.6]  CTDPRS[T.3120.1]    CTDPRS[T.3140.7]    CTDPRS[T.3141.4]    CTDPRS[T.315.5] CTDPRS[T.3159.3]    CTDPRS[T.3161]  CTDPRS[T.317.6] CTDPRS[T.3175.2]    CTDPRS[T.3225.4]    CTDPRS[T.3280.7]    CTDPRS[T.3284.4]    CTDPRS[T.3290.1]    CTDPRS[T.3300.6]    CTDPRS[T.331.2] CTDPRS[T.3323.4]    CTDPRS[T.340.4] CTDPRS[T.3445]  CTDPRS[T.345.1] CTDPRS[T.3451.3]    CTDPRS[T.3456.1]    CTDPRS[T.3461.9]    CTDPRS[T.355.2] CTDPRS[T.3550.9]    CTDPRS[T.3580.5]    CTDPRS[T.3613.4]    CTDPRS[T.364.8] CTDPRS[T.3654.3]    CTDPRS[T.3657.1]    CTDPRS[T.3680.9]    CTDPRS[T.37.9]  CTDPRS[T.3730.3]    CTDPRS[T.3751]  CTDPRS[T.376.5] CTDPRS[T.376.6] CTDPRS[T.3781.7]    CTDPRS[T.380]   CTDPRS[T.3805.3]    CTDPRS[T.3846.7]    CTDPRS[T.3933.4]    CTDPRS[T.394.5] CTDPRS[T.397.9] CTDPRS[T.4.4]   CTDPRS[T.40]    CTDPRS[T.405.3] CTDPRS[T.4116.7]    CTDPRS[T.4232.8]    CTDPRS[T.427.7] CTDPRS[T.4375.5]    CTDPRS[T.4380.4]    CTDPRS[T.4425.5]    CTDPRS[T.4449.4]    CTDPRS[T.4490.2]    CTDPRS[T.455.1] CTDPRS[T.455.6] CTDPRS[T.485.4] CTDPRS[T.488.6] CTDPRS[T.50]    CTDPRS[T.501.2] CTDPRS[T.509.8] CTDPRS[T.5093.6]    CTDPRS[T.516.7] CTDPRS[T.52.2]  CTDPRS[T.520.2] CTDPRS[T.525.5] CTDPRS[T.526.4] CTDPRS[T.5365.5]    CTDPRS[T.539.6] CTDPRS[T.550.1] CTDPRS[T.555.7] CTDPRS[T.574.2] CTDPRS[T.581.9] CTDPRS[T.59.4]  CTDPRS[T.590.9] CTDPRS[T.598.6] CTDPRS[T.6.5]   CTDPRS[T.60.5]  CTDPRS[T.60.7]  CTDPRS[T.600.3] CTDPRS[T.609.7] CTDPRS[T.636.2] CTDPRS[T.644.8] CTDPRS[T.685.4] CTDPRS[T.70.3]  CTDPRS[T.71CTDPRS[T.74.5]   CTDPRS[T.749.4] CTDPRS[T.752]   CTDPRS[T.755.2] CTDPRS[T.760.2] CTDPRS[T.767.6] CTDPRS[T.770]   CTDPRS[T.771.6] CTDPRS[T.798.5] CTDPRS[T.8.6]   CTDPRS[T.811.9] CTDPRS[T.817.1] CTDPRS[T.849.8] CTDPRS[T.852.2] CTDPRS[T.854.6] CTDPRS[T.890.1] CTDPRS[T.90.3]  CTDPRS[T.911.5] CTDPRS[T.926.5] CTDPRS[T.936]   CTDPRS[T.939.5] CTDPRS[T.945.4] CTDPRS[T.95.3]  CTDPRS[T.952.7] CTDPRS[T.969.9] CTDPRS[T.970.4] CTDPRS[T.98.7]  CTDPRS[T.990.1] CTDPRS[T.NA]
#q2:types   numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric
d6e46e7c78ebf4a2ad1ae06d81f62e99    6.26266429405215    0.721246349541283   -0.118528554841875  -0.460868573600918  0.911035084665726   -0.0140286240174749 0.505563908359061   0.454220671828338   -0.0671327621721176 0.454408171609246   0.813562496177106   0.812192168960679   0.29387666395741    -0.49145616919261   0.756639211506886   0.400264747247543   0.89830300990085    0.576096496140973   0.904000273891397   0.481988024093166   0.423911890081252   0.211347157957577   0.340499888435965   -0.455301582168929  -0.0216667187160635 -0.0667211243715614 -0.310858039838213  -0.141031128984692  0.419671310403407   -0.125851309407888  0.533843180640886   1.44306233579608    0.61380134105976    -0.370699980420485  0.485338748080432   0.12946401432241    0.868977021472778   0.0318036366049193  -0.484231661527177  -0.385083751201681  -0.244545030427242  -0.372789685462782  -0.175431989035118  -0.0744335075343431 0.54171604144651    -0.428674718921024  -0.443278122164045  0.47891172599517    -0.475582427957821  0.0126895090679926  0.0299326262845326  -0.330626281430452  0.670509850112057   0.147198354619653   0.0922213981417316  0.0466869645774893  0.819851287433506   0.436057368259834   -0.245779981447492  0.0171808366630144  0.411049418456969   0.0418374119216923  -0.229990287394701  0.348378426218603   0.774511674894767   -0.469607651476893  -0.177630900223319  0.776445508645642   0.637532218655303   0.536614997416455   -0.137922958163653  -0.40582224508243   0.103463358955136   -0.345126768583297  0.00622191106861172 0.479556828976786   0.107129521942362   0.949419039309769   0.30821354410195    0.780551038704411   -0.230374331264756  -0.453741172287351  0.347376642820596   0.397590275618087   0.716277733179224   0.0105333945384332  -0.0204050211953775 0.798268133005421   0.303208699820754   0.916840246258884   0.932020593111161   0.869009585193216   -0.366889562739929  0.274896600352142   -0.251664071603055  0.541317082996363   -0.0238103350514816 -0.430567021494016  0.825781527524871   -0.127025004164301  0.542411996139056   0.471988448763916   0.0128852147104096  -0.41133330147511   0.370164280452694   0.835754283837527   1.74575390838686    0.445611472894255   0.373198551285869   1.55284948769945    -0.282430625784262  0.516803109646316   -0.268438961161599  -0.230701724655778  1.17102045188814    -0.217883331520773  -0.316544936270762  -0.456458815469845  2.24620699327938    -0.296249859256854  0.021626367923998   -0.415262033258103  -0.154692328142734  -0.393669600259227  -0.107949595719998  0.674551267972083   0.619699220176403   0.681498862056877   0.368147086623759   1.59723174150272    -0.19847310137887   0.802267934644013   -0.467674750144609  1.23044241181125    0.758268617132654   0.0285586585561744  0.796062079287582   0.827040589916411   0.750896448575437   -0.473235409546303  -0.389129851983682  1.08538341462409    0.152347967604867   0.127003903761777   0.294420661137689   0.581182796840563   0.520558453745887   0.662011258471017   1.28539809629611    -0.0884698824099879 1.365640198691  1.03645707465593    0.0279246458326644  0.925357444471826   -0.214384486877157  0.924948118688353   0.052622486086653   0.618610734526831   -0.206767012982322  1.01905485598883    -0.407033893235235  0.267596827758334   0.390095526720145   -0.111120649963599  0.130970747632971   0.753015275272711   0.877954494722947   0.69549417481261    0.208288191187037   -0.0428709518097923 0.507930408362329   -0.391934887291255  -0.2788457963654    -0.285429711077457  0.701400522591834   0.159102538341071   0.760343047963215   0.805853098900264   0.803397908751945   0.962828648247584   0.767693430386164   0.401075517564869   0.880382040987884   0.205172315934692   0.459310164500681   0.151916751971889   -0.375192487097362  -0.4343964746052    0.836128602824274   0.740334990180792   0.144016031308188   0.553750430718036   0.37174103351254    1.18286092062309    -0.331098325089527  0.424122228546557   1.27674829479941    0.448080095112241   0.740276510854393   0.546845348249068   0.521087208683315   2.02819338746913    -0.342719017319364  -0.377352778129133  0.622979314416811   0.0931121751517784  -0.21164841373085   0.295521361117233   0.724347374409162   0.115375855501859   0.0786108631637291  0.217065194540915   0.295835634120771   -0.199220988350417  0.134518506329297   -0.293681456526919  0.5845329488764 1.28840928768187    -0.569842251211884  -0.363027416170437  0.797206321847948   -0.0756975159951875 0.820719965440082   0.677849715111999   0.129837816391377   -0.0178728446452992 1.05160401744091    -0.371002635676899  1.2920883170642 0.124757881088036   0.411487031764555   -0.0276765732175986 -0.372541875859344  -0.139978684558234  -0.0630419203429542 0.730955618093975   0.0198048472348807  -0.408429370291369  0.418259480245993   0.887406168651485   0.409069381702858   0.648515745295158   0.708973067847126   -0.238274394366066

This is also happening when I have formulas with multiple numeric variables, e.g. "CTDPRS+CTDTMP+LATITUDE+consensus_age_interpolated"

Thank you!

mortonjt commented 3 years ago

interesting -- do you have any categorical variables that snuck into these columns? Sometimes if you leave values blank, or if you have a value such as none, it will default to a categorical variable. If you want, you can post your metadata file to help with debugging.

bck243 commented 3 years ago

Thank you! Here is my metadata file. There is one "NA" in CTDPRS.

iTAG_metadata_16S_all_for_q2_8_2020.txt

mortonjt commented 3 years ago

ok, if you see the following samples, they have a ton of NAs. I'd try to drop all samples that don't have continuous value measurements and see if you can get something reasonable

E_18A_16S_C_AGAGTCAC E_18A_16S_G_TAGCGAGT E_18B_16S_G_CTGCGTGT E_18B_16S_C_TACGAGAC

On Mon, Feb 8, 2021 at 12:03 PM bck243 notifications@github.com wrote:

Thank you! Here is my metadata file. There is one "NA" in CTDPRS.

iTAG_metadata_16S_all_for_q2_8_2020.txt https://github.com/biocore/songbird/files/5946119/iTAG_metadata_16S_all_for_q2_8_2020.txt

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/biocore/songbird/issues/151#issuecomment-775371496, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA75VXJ4EECTEYYWHGDL4KLS6AYR3ANCNFSM4XJM5ZTQ .

bck243 commented 3 years ago

Ok, thanks!

More generally, is there a way to manually specify that a variable is continuous, or will I need to remove all NA's from metadata for the other variables that are causing me trouble? For example, I have more "NAs" in "consensus_age" than "CTDPRS".

I fixed the metadata to not have NA's for CTDPRS for those samples and am re-running qiime songbird multinomial. I'm expecting it to take a couple days, because it did last time.

I also tried subsetting my data to see if it works more quickly, but I come up against another error when I try that:

Get small subset of 16S samples:

source activate qiime2-2020.6
qiime feature-table filter-samples \
 --i-table P18_16S_all_runs_raw_no_contam_or_control2.qza \
 --m-metadata-file example_samples.txt \
 --o-filtered-table example_samples.qza

Run songbird multinomial test with new metadata:

qiime songbird multinomial \
  --i-table ./data/example_samples.qza \
  --m-metadata-file ./data/iTAG_metadata_16S_all_for_q2_2_2021.tsv \
  --p-formula "CTDPRS" \
  --p-epochs 10000 \
  --p-differential-prior 0.5 \
  --p-summary-interval 1 \
  --o-differentials CTDPRS_example_samples.qza \
  --o-regression-stats CTDPRS_example_samples_regression-stats.qza \
  --o-regression-biplot CTDPRS_example_samples_regression-biplot.qza

Error:

Plugin error from songbird:

  initial_value must have a shape specified: Tensor("random_normal:0", shape=(6, ?), dtype=float32)

Debug info has been saved to /usr/local/scratch/path/tmp/qiime2-q2cli-err-gepubej9.log

Check that subsetting didn't go wrong:

qiime tools export \
  --input-path example_samples.qza \
  --output-path example_sample_subset
biom convert -i feature-table.biom -o feature-table.tsv --to-tsv

Subsetted data looks fine:

> head -n 3 feature-table.tsv 
# Constructed from biom file
#OTU ID A_101S_16S_G_ACTATCTG   B_10S_16S_G_GACACCGT    C_100S_16S_G_ACTATCTG   C_105S_16S_G_CTGCGTGT   C_1S_16S_G_CTGCGTGT D_103S_16S_G_ACTATCTG   D_104S_16S_G_CTGCGTGT
a0381498f3581ed0249c8a1cd28b6e3b    0.0 0.0 0.0 21.0    0.0 0.0 0.0
mortonjt commented 3 years ago

right, you'll need to drop those variables in order for it to be continuous (since NA will now be treated as a categorical variable).

we could probably handle it as missing data at some point, but that'll require a bit of thought on the underlying model.

On Mon, Feb 8, 2021 at 4:25 PM bck243 notifications@github.com wrote:

Ok, thanks!

More generally, is there a way to manually specify that a variable is continuous, or will I need to remove all NA's from metadata for the other variables that are causing me trouble? For example, I have more "NAs" in "consensus_age" than "CTDPRS".

I fixed the metadata to not have NA's for CTDPRS for those samples and am re-running qiime songbird multinomial. I'm expecting it to take a couple days, because it did last time.

I also tried subsetting my data to see if it works more quickly, but I come up against another error when I try that: Get small subset:

source activate qiime2-2020.6 qiime feature-table filter-samples \ --i-table P18_16S_all_runs_raw_no_contam_or_control2.qza \ --m-metadata-file example_samples.txt \ --o-filtered-table example_samples.qza

run test

qiime songbird multinomial \ --i-table ./data/example_samples.qza \ --m-metadata-file ./data/iTAG_metadata_16S_all_for_q2_2_2021.tsv \ --p-formula "CTDPRS" \ --p-epochs 10000 \ --p-differential-prior 0.5 \ --p-summary-interval 1 \ --o-differentials CTDPRS_example_samples.qza \ --o-regression-stats CTDPRS_example_samples_regression-stats.qza \ --o-regression-biplot CTDPRS_example_samples_regression-biplot.qza

Error:

Plugin error from songbird:

initial_value must have a shape specified: Tensor("random_normal:0", shape=(6, ?), dtype=float32)

Debug info has been saved to /usr/local/scratch/METAGENOMICS/bkolody/tmp/qiime2-q2cli-err-gepubej9.log

Check that subsetting didn't go wrong:

qiime tools export \ --input-path example_samples.qza \ --output-path example_sample_subset biom convert -i feature-table.biom -o feature-table.tsv --to-tsv

Subsetted data looks fine:

head -n 3 feature-table.tsv

Constructed from biom file

OTU ID A_101S_16S_G_ACTATCTG B_10S_16S_G_GACACCGT C_100S_16S_G_ACTATCTG C_105S_16S_G_CTGCGTGT C_1S_16S_G_CTGCGTGT D_103S_16S_G_ACTATCTG D_104S_16S_G_CTGCGTGT

a0381498f3581ed0249c8a1cd28b6e3b 0.0 0.0 0.0 21.0 0.0 0.0 0.0

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/biocore/songbird/issues/151#issuecomment-775531402, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA75VXOX4OWLM3ZVH5HFJQ3S6BXGDANCNFSM4XJM5ZTQ .

bck243 commented 3 years ago

Working after removing all NA's, thanks!

thermokarst commented 3 years ago

Hi @bck243, like @mortonjt mentioned, the NAs are the problem here. You don't need to drop the entire sample from your metadata file, though, simply remove the NA value from the cell. Please see here for more details on the QIIME 2 metadata spec:

https://docs.qiime2.org/2021.4/tutorials/metadata/