Open banditelol opened 1 year ago
From this article, while the idea seems ridiculous, for me that has budget for coffee (a strict one). This is interesting and disappointing. Since the conclusion basically says that it doesn't worth it for most cases.
But from that blog I found another interesting ones about the potential of coffee grounds here's the gist.
A blog I found has this section called todo. At first I thought it was a list of his own to do list made available so he could remember it again. But turned out it's a more interesting list.
I need to do this! @banditelol please do this!
also Imma create an issue of awesome list later, maybe I'll edit #3 for more general purpose. Like:
So, since sealnet aims to reduce interference with the seals, is there any prior data about CMR for seals matched with its photograph? Or what kind of ground truth exists out there and how can scientist know for sure that the Adrian the seal is actually not his close brother Bryan?
Or is there any more general characteristic of a seal that you identify? For obvious example male vs female or is there some kind of long lasting mark?
Need to add link to the blogs and Reason why I follow them as reminder what content I consume from each. Also need to add to #3 after renaming it.
I remember gitlabs handbook of value is really great, but I don't remember it's this good. Especially the one for low level of shame which admittedly I really struggle with. And striving to only show the best part of my process.
I guess learning about how people design data platform may worth the time, for example looking at the documentation of Materialize
Derive exceptions from Exception rather than BaseException. Direct inheritance from BaseException is reserved for exceptions where catching them is almost always the wrong thing to do.
https://peps.python.org/pep-0008/#programming-recommendations
Sqlite based feature store for iteration
While comparing kubeflow vs mlflow there's this bonus section discussing about semantic layer
It has been around for a while (star schema with entity table). But now with current OLAP we can start from denormalized table. So we can do more group by with less join. And then we can use it to develop one semantic layer either thick (looker) or thin (superset). Can we do the same in MLEng vs BI.
Some people isn't happy with semantic layer for ML. It's not clear how the downstream effect of current data changes. So what the hell is semantic layer and datamesh. Data Mesh is still a buzzword, and it needs semantic layer. Semantic layer in looker just a json file config for our data model to generate SQL to query underlying database. So we don't need to build it until final layer and let the caching and intermediate table managed by the tool. So semantic layer is up to developer and there to generate a language.
Batch use case for semantic layer on top of feature store, how it would like
We need to define the semantic of data in the data itself not on the endpoint of notebook. This tied to ML when constructing metrics (cleaning, etc) in dbt. And also data catalogue to
Given a certain geojson data, a json file is used as a logic to generate query. Why we need to do this vs just use SQL query.
Got interesting AskHN about homeserver setup, one answer seems interesting in particular
Looking at AskHN thread on Artsy's Eng Handbook as OS Readme file. There's RFC template file that's worth to read. PR is important in first issue.
This is the most comprehensive and high quality writing I've seen that includes almost all engineering processes from interviewing to communication. Would love seeing other such engineering handbooks you've seen.
Also adding two other good ones: Sourcegraph - https://handbook.sourcegraph.com/ & Gitlab - https://about.gitlab.com/handbook/#engineering
It's an Employee Handbook, not an Engineering Handbook, but Valve's is very, very well produced[1].
[1] (PDF): https://cdn.akamai.steamstatic.com/apps/valve/Valve_Ne
dblock 1 hour ago | parent | next [–]
(I was CTO of Artsy then.)
tl;dr The README became possible because Artsy is open-source by default, and someone just decided one day to create a repo and some content, and didn't need permission to do so. It's also the repo that most new hires read before they even apply to the job, and they don't need permission to make changes either. GitHub workflow is how everything gets done.
More practically, check out https://github.com/artsy/meta/pull/1, which is one of the repos that merged into the handbook via https://github.com/artsy/README/pull/1. Also note that Artsy was founded in 2010. This handbook in its current iteration is 7-8 years in, but its content goes back to ~2011 in some kind of evolution. You'll want to check out https://artsy.github.io/blog/archives/ as well.
Example of writing abstraction over several CLI as workflow. Github.com/artsy/hokusei
I forgot where I heard this quote, that one of the hardest thing in programming is naming things. There are several interesting resources that is worth a read:
I keep on stumbling in remembering this word. I kinda want to put it togwther with pedagogy, andragogy and mnemonics. Basically its the study of knowledge and how we know what we know. I first stumbled upon this on tools for thought and memex movement. And later reinforced by Andy Matuschak.
Currently I've been fiddling around with github codespaces, gitpod, etc. To enable working on remote machine with on-demand model. Because previously I was using my own laptop with Zerotier for my dev environment but it actually took quite a big portion of my electricity bill (since I need to keep it awake the whole time because the BIOS doesn't support wake and power off by LAN. For now both gitpod and codespace does answer my problem, but lacking in persistent volume for recurrent task.
I'd like to be able to mount either a folder in my bucket storage of a storage disk in my cloud provider to my remote SSH. so that I could work with relatively big data easily. It can be solved by using GCSFS or the like, but it's too much work. Also I wanted to try github codespace with GPU for experimenting online because google colab is too restrictive in the ability to use IDE and structure my code properly.
Another option is to use coder which I've just found recently. But I still don't know if they also manage auto-shutdown of an instance, since if not it will defeat the purpose of being on demand.
Another option I want to try is paperspace gradient, databricks and sagemaker. But on quick glance it seems too centered towards notebook based instance.
Another use case is that training remotely. I want to try the following:
It's linear! found here I don't know when would I need it, just seems interesting
I've been thinking about this topic a lot recently. Found a nice post in medium about pattern found in de but one topic that I really want to explore is how to architect the DE process itself and the code (in this case python). Is Design Pattern still highly applicable in this case? Or there are another classes of pattern that worth looking into?
May need to talk with people with lots of DE experience under their hand
Another additions to resources:
https://gist.github.com/joshbuchea/6f47e86d2510bce28f8e7f42ae84c716 https://cbea.ms/git-commit/
I need to really make up my mind and habituate myself by writing better commit message. Try to implement simonwilson the perfect commit.
Link dump for mas Ridho:
https://github.com/laserkelvin/cookiecutter-pytorch-project
I want to create a website I mean a book that is like a river station of the toolbox that I usually use today and a little box I could clean the skills or the tools or the framework or the mindset that I have depending on how I could utilize it in a project or in a blog post or daily limit of what I learned
Also an idea for an article in my book is to have an explanation of what is the difference between multi index data frame and also a grouper yeah that's what
Another isbthe representation of related by using cube in 3d, how it look clise in 2d doesnt mean it is close in 3d
Gitlab Handbook is a treasure trove of quality content
When should we try to purposefully overfit on a batch? https://pytorch-lightning.readthedocs.io/en/stable/common/trainer.html#overfit-batches
I'm thinking about using oauth2-proxy for simple streamlit app so that we could deploy internal apps gracefully.
Gotta create a comprehensive checklist based on pythonspeed advice. Also a simple dockerfile for generic projecr https://pythonspeed.com/articles/
Later if I want to test SQL, this can be a good guide
Docker Networking on Windows and Mac
Using macOS or Windows? Use CUBEJS_DB_HOST=host.docker.internal instead of localhost if your database is on the same machine. Said Cube