EBI-Metagenomics / MGnifyR

R package for searching, downloading and analysis of EBI MGnify metagenomics data
https://ebi-metagenomics.github.io/MGnifyR/
Artistic License 2.0
19 stars 10 forks source link

MGnifyR - query samples by depth #17

Open alexschickele opened 12 months ago

alexschickele commented 12 months ago

Dear Ben, As part of the BlueCloud2026 project, I am trying to build a data access function to retrive species and KEGG (probably) annotations and the corresponding reads, from the MGnify database. The first step would be to get the list of marine samples within a certain depth and time range. If I understood correctly, the MGnifyR package does not allow for multiple filters in one query. That is why I first tried to query according to depth levels, that should be the most disciminant factor for filtering samples.

However, I encountered some issues with the mgnify_query() function. Here are some examples :

library(MGnifyR)
library(tidyverse)
mg <- mgnify_client(usecache = TRUE)

Trying with the metadata_value_gte argument. I guess it refers to "greater than".

foo <- mgnify_query(mg, "samples", biome_name = "Marine", 
                     metadata_key = "depth", 
                     metadata_value_gte = 100,
                     maxhits = 5,
                     usecache = TRUE)
foo$depth %>% unique()
 [1] "1988.0" "75.0"   "102.0"  "1008.0" "119.0"  "182.0"  "101.0"  "30.0"   "111.0"  "5601.0" "202.0"  "143.0"  "150.0"  "151.0"  "100.0"  "135.0"  "149.0" 
[18] "200.0"  "233.0"  "201.0"  "380.0"  "175.0" 

Trying with the metadata_value_gte argument. I guess it refers to "lower than".

foo <- mgnify_query(mg, "samples", biome_name = "Marine", 
                    metadata_key = "depth", 
                    metadata_value_lte = 100,
                    maxhits = 5,
                    usecache = TRUE)
foo$depth %>% unique()
 [1] "1988.0" "75.0"   "15.0"   "76.0"   "52.0"   "30.0"   "16.0"   "10.0"   "91.0"   "51.0"   "2.0"    "119.0"  "21.0"   "182.0"  "50.0"   "111.0"  "202.0" 
[18] "33.0"   "49.0"   "74.0"   "14.0"   "68.0"   "40.0"   "151.0"  "69.0" 

Both _gte and _lte does not seem to work properly. Therefore I tried to query only a single depth layer. Which would then be parallelized to query all depth levels within our range.

foo <- mgnify_query(mg, "samples", biome_name = "Marine", 
                    metadata_key = "depth", 
                    metadata_value = 100,
                    maxhits = -1,
                    usecache = TRUE)
foo$depth %>% unique()
[1] "100.0"

The samples equal to 100 m depth seems to work. However, if we try another depth level, it does not anymore...

foo <- mgnify_query(mg, "samples", biome_name = "Marine", 
                    metadata_key = "depth", 
                    metadata_value = 5,
                    maxhits = -1,
                    usecache = TRUE)
foo$depth %>% unique()
[1] "5.0"  "2.0"  "0.3"  "0.29" "0.33" "0.28"

Therefore, I am wondering if I am missing something in the functions or miss-use them ? Thank your in advance for your feedback, Best,

Alexandre