NCEAS / learning-hub-organization

This repo uses GitHub projects to manage Learning Hub tasks that Learning Hub Team Leads work on
0 stars 0 forks source link

Help with `deltafish` R package #114

Closed camilavargasp closed 17 hours ago

camilavargasp commented 4 weeks ago

Sam Bashevkin approached Delta Stewardship Council with the following request:

Hi Maggie,

If you remember, as part of our NCEAS work group, Jeanette from NCEAS developed the deltafish R package (https://github.com/Delta-Stewardship-Council/deltafish) that provides access to the large integrated database of fish monitoring data. I’ve been working with collaborators at CDFW to add more data to the package, particularly the Salvage dataset, which has unfortunately made the dataset so large that the package is no longer working. Jeanette would be the best person to fix it, so I was wondering if there is any chance you have funding in your NCEAS contract for some follow-up on the past workshop so that Jeanette would be able to take a look at the package and get it working again. I’d be happy to chat about this if it would be easier.

I hope all is well at DSP!

Best,

Sam

camilavargasp commented 4 weeks ago

Next Steps

angelchen7 commented 1 week ago

Met with Sam

I met with Sam today and he showed me where deltafish stopped working. He suspects that the left_join call to join two arrow datasets together is causing his R session to crash. He left-joined a dataset with 112 rows with a 60 million rows dataset. The result after running collect should be a simple 112 row dataset yet his R session crashes with the message, "terminate called after throwing an instance of..."

This may be indicative of a larger issue with arrow itself, and not necessarily have something to do with the updated Delta fish data (which updated from 40 million to 60 million rows).

Next Steps

I felt like this issue is beyond my abilities so I need to consult with Jeanette on some options moving forward (opening a GitHub issue on the arrow repo? converting to duckdb?...) I told Sam that he could open a Slack group chat with Jeanette and I, just so Jeanette knows what's going on but I will be the one carrying out the debugging.

angelchen7 commented 1 week ago

Chatted with Jeanette

Yesterday, Sam opened a Slack group chat with me and Jeanette, and he explained his problem there. Jeanette suggested a workaround where the collect call is ran first, and then join. That seemed to work fine for him so I think he'll stick to collecting before joining for now in his script. I think he was a bit disappointed that arrow couldn't deal with large joins of uncollected data but there's not too much we can do in that area.

I'll keep this issue up for a bit before closing in case Sam had additional follow-up questions on Slack.