KnowledgeCaptureAndDiscovery / somef

SOftware Metadata Extraction Framework: A tool for automatically extracting relevant software information from readme files
MIT License
44 stars 22 forks source link

Add a sample test repository #595

Open dgarijo opened 11 months ago

dgarijo commented 11 months ago

This repository: https://github.com/tpronk/somef-demo-repo should be added in the documentation

tpronk commented 11 months ago

I have to confess it's not really that good yet. However, my attention during the ELIXIR BioHackathon has already been requested for other tasks... hope to have an update soon!

dgarijo commented 11 months ago

no rush!

tpronk commented 11 months ago

Back in business! Below, I'll add things I noticed while creating the demo repo. As I won't be done today, more might follow...

dgarijo commented 11 months ago

Thanks, let me open this in a new issue. Many people have been editing the d ocs, and keeping everything consistent can be challenging

tpronk commented 11 months ago

Having too many contributors sounds like a lovely problem to have :). Below, I got two more potential issues... I'll post them in separate comments

tpronk commented 11 months ago

I think there might be an issue with extracting a logo when there is no slash (/) in the path to the logo. For illustration, below is a snippet of the README.md of the somef-demo-repo, followed by a snippet of the JSON output of SOMEF. Note that logo1.png is not recognized as a logo, but logo_directory/logo2.png is. Same result if I use logo.png and if I don't have the logo_directory/logo2.png in the README.md

README.md

# Image
Images used to illustrate the software component.
![logo1.png](logo1.png)

# Logo
Main logo used to represent the target software component.
![logo2.png](logo_directory/logo2.png)

SOMEF Output

"logo": [
  {
    "result": {
      "type": "Url",
      "value": "https://raw.githubusercontent.com/tpronk/somef-demo-repo/main/logo_directory/logo2.png"
    },
    "confidence": 1,
    "technique": "regular_expression",
    "source": "https://raw.githubusercontent.com/tpronk/somef-demo-repo/main/README.md"
  }
],
"image": [
  {
    "result": {
      "type": "Url",
      "value": "https://raw.githubusercontent.com/tpronk/somef-demo-repo/main/logo1.png"
    },
    "confidence": 1,
    "technique": "regular_expression",
    "source": "https://raw.githubusercontent.com/tpronk/somef-demo-repo/main/README.md"
  }
]
tpronk commented 11 months ago

At the Hackathon, we've been extracting metadata from around 65 repos, but in none of the SOMEF output can I find the field has_executable_notebook. Also, in the SOMEF source code, I couldn't easily identify any snippets that extract it. Does this field still work? If so, might you have an example for me of a repo where it can be extracted from?

tpronk commented 11 months ago

I found a case where values extracted for the invocation field were attributed to README.md, but on visual inspection, I found them in README.Rmd instead. It concerns this repo. Below is a snippet of the SOMEF output. Credits to Esteban for providing this dataset :)

    "invocation": [
        {
            "result": {
                "type": "Text_excerpt",
                "value": "\n```{r, echo=FALSE, results='asis', message = FALSE}\nmy_apc %>% select(institution, euro) %>% \n  group_by(institution) %>% \n  ezsummary::ezsummary(n = TRUE, digits= 0, median = TRUE,\n                       extra = c(\n                         sum = \"sum(., na.rm = TRUE)\",\n                         min = \"min(., na.rm = TRUE)\",\n                         max = \"max(., na.rm = TRUE)\"\n                         )) %>%\n  mutate_all(format, big.mark=',') %>%\n  ezsummary::ezmarkup('...[. (.)]..[. - .]') %>%\n#> get rid of blanks\n  mutate(`mean (sd)` = gsub(\"\\\\(  \", \"(\", .$`mean (sd)`)) %>% \n  select(institution, n, sum, `mean (sd)`, median, `min - max`) %>%\n  arrange(desc(n)) %>%\n  knitr::kable(col.names = c(\"Institution\", \"Articles\", \"Spending total (in \u20ac)\", \"Mean (SD)\", \"Median\", \"Minimum - Maximum\"), align = c(\"l\",\"r\", \"r\", \"r\", \"r\", \"r\"))\n``` \n",
                "original_header": "Fully Open Access Journals"
            },
            "confidence": 0.906763643352601,
            "technique": "supervised_classification",
            "source": "https://raw.githubusercontent.com/MPDL/unibiAPC/master/README.md"
        },
        {
            "result": {
                "type": "Text_excerpt",
                "value": "```{r, echo = FALSE, warning = TRUE}\n\nknitr::opts_knit$set(base.url = \"/\")\nknitr::opts_chunk$set(\n  comment = \"#>\",\n  collapse = TRUE,\n  warning = FALSE,\n  message = FALSE,\n  echo = FALSE,\n  fig.width = 9,\n  fig.height = 6\n)\noptions(scipen = 999, digits = 0, tibble.width = Inf, tibble.print_max = Inf)\n\nknitr::knit_hooks$set(inline = function(x) {\n  prettyNum(x, big.mark = \",\")\n})\n```\n```{r}\nrequire(dplyr)\nrequire(ggplot2)\nrequire(ezsummary)\nrequire(pander)\n```\n```{r, echo=FALSE, cache = FALSE}\nmy_apc <- readr::read_csv(\"data/apc_de.csv\")\n```\n \n"
            },
            "confidence": 0.9211067534061969,
            "technique": "supervised_classification",
            "source": "https://raw.githubusercontent.com/MPDL/unibiAPC/master/README.md"
        }
    ]
dgarijo commented 11 months ago

Thanks for these issues. executable_notebook should return the my binder links. I see that now these are added in executable_example. This may need a review (the schema suffered a few changes). All other issues are legit. Thanks a lot! We'll need to address them

dgarijo commented 11 months ago

If you find any more, please open them! I usually open them as I test in diverse repos, but some time is tricky getting to these edge cases

tpronk commented 11 months ago

Bueno & gracias. I'll keep 'em coming then :)

tpronk commented 11 months ago

Wrapping things up, I compared fields mentioned in the README.md of SOMEF to the fields in constants.py. These are the discrepancies I found in terms of entries I couldn't find in one or the other, ignoring cases where they probably just have a different name

tpronk commented 11 months ago

All right then. SOMEF 0.9.4 can extract a total of 48 fields from this version of somef-demo-repo, which can make it a nice integration test I guess

dgarijo commented 11 months ago

Definitely. Thanks!!