ge11232002 / TFBSTools

Software Package for Transcription Factor Binding Site (TFBS) Analysis
25 stars 10 forks source link

low number of TFs #35

Open rojinsafavi opened 2 years ago

rojinsafavi commented 2 years ago

Hello,

I am using a package called ArchR to do ATACseq analysis. ArchR uses TFBSTools to extract motifs. I noticed there is only 633 TF extracted from Jaspar, but I think the number is much higher (~1200). I was wondering if there is a reason for such a low number? I talked to ArchR developers, and they said they don't do any sort of processing internally, and that I should probably talk to you regarding this matter.

Thank you

> ArchR:::.requirePackage("JASPAR2020",installInfo='BiocManager::install("JASPAR2020")')
> args <- list(species = "Homo sapiens", collection = "CORE")
> motifs <- TFBSTools::getMatrixSet(JASPAR2020::JASPAR2020, args)
> motifs
PFMatrixList of length 633 names(633): MA0030.1 MA0031.1 MA0051.1 MA0057.1 MA0059.1 MA0066.1 MA0069.1 MA0070.1 ... MA0814.2 MA1123.2 MA0093.3 MA0526.3 MA0748.2 MA0528.2 MA0609.2
alexying2110 commented 2 months ago

You've probably solved this by now but for anyone searching for this, this appears this could be at least partially due to an order of operations bug. I haven't had the chance to verify this, but looking at the code, the default behavior when all_versions in opts is either not provided or FALSE is to remove all TFs whose version does not equal the latest version available in the JASPAR database. However, some TFs may have a latest version that doesn't match the "opts" supplied to the getMatrixSet which leads to them being removed. E.g. at the time of writing in JASPAR2024, for SPI1/PU.1 the latest version of the motif is MA0080.7 for Mus musculus, but if species = "Homo sapiens, then SPI1/PU.1 won't appear because none of the Homo sapiens motifs are the latest version.

I'm not sure if this project is still maintained, so I'm not sure if I should just put in a PR or fork the repo. I think the simplest solution is to propagate opts to the latest version filter.

In the meantime this is a rough way to do this in base R.

motifs <- TFBSTools::getMatrixSet(
  RSQLite::dbConnect(RSQLite::SQLite(), "/path/to/JASPAR2024.sqlite"),
  list(
    species = "Homo sapiens",
    collection = "CORE",
    all_versions = TRUE
  )
)
latest_motifs <- Reduce(
  function(a, b) {
    # Split into base_id and version
    id <- strsplit(b, "\\.")[[1]]
    # Make version numeric
    id[2] <- as.numeric(id[2])
    if (! id[1] %in% names(a)) {
      a[id[1]] = id[2]
    } else {
      if (id[2] > a[id[1]]) {
        a[id[1]] = id[2]
      }
    }
    return(a)
  },
  names(motifs),
  init = list()
)
latest_motifs <- paste(names(latest_motifs), latest_motifs, sep = ".")

motifs_filtered <- subset(motifs, names(motifs) %in% latest_motifs)

This gives the identical result to the 755 motifs (up from 720 without all_versions = T in opts) obtained using the following in sqlite3:

SELECT COUNT(DISTINCT M.BASE_ID) FROM MATRIX M
JOIN MATRIX_SPECIES MS ON M.ID = MS.ID
WHERE M.COLLECTION = 'CORE' AND MS.TAX_ID = 9606;