banditelol commented 1 year ago

Docker Networking on Windows and Mac

Using macOS or Windows? Use CUBEJS_DB_HOST=host.docker.internal instead of localhost if your database is on the same machine. Said Cube

banditelol commented 1 year ago

Rebrewing Coffee Grounds

From this article, while the idea seems ridiculous, for me that has budget for coffee (a strict one). This is interesting and disappointing. Since the conclusion basically says that it doesn't worth it for most cases.

But from that blog I found another interesting ones about the potential of coffee grounds here's the gist.

banditelol commented 1 year ago

Garrit's Todo List

A blog I found has this section called todo. At first I thought it was a list of his own to do list made available so he could remember it again. But turned out it's a more interesting list.

I need to do this! @banditelol please do this!

also Imma create an issue of awesome list later, maybe I'll edit #3 for more general purpose. Like:

Blog list
Friend list?
Competition list
Book list

banditelol commented 1 year ago

Sealnet Stuffs

Paper on another animals

Ground truth

So, since sealnet aims to reduce interference with the seals, is there any prior data about CMR for seals matched with its photograph? Or what kind of ground truth exists out there and how can scientist know for sure that the Adrian the seal is actually not his close brother Bryan?

Or is there any more general characteristic of a seal that you identify? For obvious example male vs female or is there some kind of long lasting mark?

banditelol commented 1 year ago

Interesting Place to Apply

banditelol commented 1 year ago

People That I Follow

Andy Matuschak
Vicky Boykis
Julia Evans
Maggie Appleton
Nick Janetakis
Lilian Weng
Martin Fowler
Paul Graham
Simon Wilson
Ian Ozsvald
James Powell
Vincent D. Warmerdam
Jay Alammar
beepb00p.xyz
Ralph Ammer - Talks about drawing, thinking, and generating ideas

Need to add link to the blogs and Reason why I follow them as reminder what content I consume from each. Also need to add to #3 after renaming it.

Interesting DS Blog

engineeringfordatascience.com
theseattledataguy.com
confessionsofadataguy.com
waitbutwhy.com
dragan.rocks
simplystatistics.org
robjhyndman.com/hyndsight
johndcook.com/blog
statquest.org
medium.com/kaggle-blog
pymc.io/projects/examples/en/latest/blog/category/beginner.html
maxwellrules.com
montecarlodata.com/blog
thinkful.com/blog/tag/data-science
research.facebook.com/research-areas/data-science
flowingdata.com
storytellingwithdata.com/blog
https://sayak.dev/

Podcasts

Quantitude
Radiolab
Freakonomics Radio

Interesting General Sites

Marginalian

banditelol commented 1 year ago

Gitlab's Value

I remember gitlabs handbook of value is really great, but I don't remember it's this good. Especially the one for low level of shame which admittedly I really struggle with. And striving to only show the best part of my process.

banditelol commented 1 year ago

Contributing

There are several repo that I'm interested in contributing to:

obsidian-dictionary

banditelol commented 1 year ago

https://ooh.directory

banditelol commented 1 year ago

Designing Data Platform

I guess learning about how people design data platform may worth the time, for example looking at the documentation of Materialize

banditelol commented 1 year ago

Some Exceptions Should not be caught

Derive exceptions from Exception rather than BaseException. Direct inheritance from BaseException is reserved for exceptions where catching them is almost always the wrong thing to do.

https://peps.python.org/pep-0008/#programming-recommendations

banditelol commented 1 year ago

Sqlite based feature store for iteration

banditelol commented 1 year ago

Several Links for Scraping

Selenium Getting Started

Modal for web scraping

PageObject concept by Martin Fowler

banditelol commented 1 year ago

Entity Centric Modeling

While comparing kubeflow vs mlflow there's this bonus section discussing about semantic layer

It has been around for a while (star schema with entity table). But now with current OLAP we can start from denormalized table. So we can do more group by with less join. And then we can use it to develop one semantic layer either thick (looker) or thin (superset). Can we do the same in MLEng vs BI.

Semantic Layer

Some people isn't happy with semantic layer for ML. It's not clear how the downstream effect of current data changes. So what the hell is semantic layer and datamesh. Data Mesh is still a buzzword, and it needs semantic layer. Semantic layer in looker just a json file config for our data model to generate SQL to query underlying database. So we don't need to build it until final layer and let the caching and intermediate table managed by the tool. So semantic layer is up to developer and there to generate a language.

Batch use case for semantic layer on top of feature store, how it would like

We need to define the semantic of data in the data itself not on the endpoint of notebook. This tied to ML when constructing metrics (cleaning, etc) in dbt. And also data catalogue to

Example

Given a certain geojson data, a json file is used as a logic to generate query. Why we need to do this vs just use SQL query.

One point is version control.
Logic grouping can be more effective
Enforcing logic (data test and unit test) DBT just make it so easily, but it values more on the data pipeline. In enrichment
- Ingest
- Transform (DBT)
- Push Out (Semantic Layer) But the line is still fuzzy and dynamic.

banditelol commented 1 year ago

Homeserver

Got interesting AskHN about homeserver setup, one answer seems interesting in particular Screenshot_2023-01-06-20-23-01-140_org.mozilla.fenix.jpg

banditelol commented 1 year ago

Engineering Readme and Handbook

Looking at AskHN thread on Artsy's Eng Handbook as OS Readme file. There's RFC template file that's worth to read. PR is important in first issue.

This is the most comprehensive and high quality writing I've seen that includes almost all engineering processes from interviewing to communication. Would love seeing other such engineering handbooks you've seen.

Also adding two other good ones: Sourcegraph - https://handbook.sourcegraph.com/ & Gitlab - https://about.gitlab.com/handbook/#engineering

It's an Employee Handbook, not an Engineering Handbook, but Valve's is very, very well produced[1].

[1] (PDF): https://cdn.akamai.steamstatic.com/apps/valve/Valve_Ne

Also dblock comment

dblock 1 hour ago | parent | next [–]

(I was CTO of Artsy then.)

tl;dr The README became possible because Artsy is open-source by default, and someone just decided one day to create a repo and some content, and didn't need permission to do so. It's also the repo that most new hires read before they even apply to the job, and they don't need permission to make changes either. GitHub workflow is how everything gets done.

More practically, check out https://github.com/artsy/meta/pull/1, which is one of the repos that merged into the handbook via https://github.com/artsy/README/pull/1. Also note that Artsy was founded in 2010. This handbook in its current iteration is 7-8 years in, but its content goes back to ~2011 in some kind of evolution. You'll want to check out https://artsy.github.io/blog/archives/ as well.

Hokusai

Example of writing abstraction over several CLI as workflow. Github.com/artsy/hokusei

banditelol commented 1 year ago

Naming Things

I forgot where I heard this quote, that one of the hardest thing in programming is naming things. There are several interesting resources that is worth a read:

banditelol commented 1 year ago

Epistemology

I keep on stumbling in remembering this word. I kinda want to put it togwther with pedagogy, andragogy and mnemonics. Basically its the study of knowledge and how we know what we know. I first stumbled upon this on tools for thought and memex movement. And later reinforced by Andy Matuschak.

banditelol commented 1 year ago

My Near-Ideal Work Platform

Working Remotely

Currently I've been fiddling around with github codespaces, gitpod, etc. To enable working on remote machine with on-demand model. Because previously I was using my own laptop with Zerotier for my dev environment but it actually took quite a big portion of my electricity bill (since I need to keep it awake the whole time because the BIOS doesn't support wake and power off by LAN. For now both gitpod and codespace does answer my problem, but lacking in persistent volume for recurrent task.

I'd like to be able to mount either a folder in my bucket storage of a storage disk in my cloud provider to my remote SSH. so that I could work with relatively big data easily. It can be solved by using GCSFS or the like, but it's too much work. Also I wanted to try github codespace with GPU for experimenting online because google colab is too restrictive in the ability to use IDE and structure my code properly.

Another option is to use coder which I've just found recently. But I still don't know if they also manage auto-shutdown of an instance, since if not it will defeat the purpose of being on demand.

Another option I want to try is paperspace gradient, databricks and sagemaker. But on quick glance it seems too centered towards notebook based instance.

Training Remotely

Another use case is that training remotely. I want to try the following:

spotty
dvc's TPI

banditelol commented 1 year ago

Aho-corasick string search

It's linear! found here I don't know when would I need it, just seems interesting

banditelol commented 1 year ago

Pattern in Data Engineering

I've been thinking about this topic a lot recently. Found a nice post in medium about pattern found in de but one topic that I really want to explore is how to architect the DE process itself and the code (in this case python). Is Design Pattern still highly applicable in this case? Or there are another classes of pattern that worth looking into?

May need to talk with people with lots of DE experience under their hand

banditelol commented 1 year ago

Reproducible Checklist Kaggle Winner Template

banditelol commented 1 year ago

Nasa uses src folder

banditelol commented 1 year ago

Another additions to resources:

SQL style guides, it will be useful when working with multiple people on analysis
The Turing Way of reproducibility and collaboration
Python Design Pattern
Refactoring Guru Design Pattern

banditelol commented 1 year ago

https://gist.github.com/joshbuchea/6f47e86d2510bce28f8e7f42ae84c716 https://cbea.ms/git-commit/

I need to really make up my mind and habituate myself by writing better commit message. Try to implement simonwilson the perfect commit.

banditelol commented 1 year ago

Link dump for mas Ridho:

https://discourse.getdbt.com/t/how-we-set-up-our-computers-for-working-on-dbt-projects/243/3
Data diff between table : https://github.com/datafold/data-diff
DBT CI example : https://github.com/datafold/demo/tree/master/.github/workflows
SQLAlchemy Clickhouse : https://github.com/xzkostyan/clickhouse-sqlalchemy
Airflow + DBT on Model Level : https://docs.astronomer.io/learn/airflow-dbt#use-case-2-dbt-core-and-airflow-at-the-model-level
Testing code vs data : https://www.youtube.com/watch?v=hxvVhmhWRJA&list=TLPQMTQwMjIwMjN_L4bWT1_OxQ&index=4

banditelol commented 1 year ago

Create my own ideal cookiecutter

https://github.com/laserkelvin/cookiecutter-pytorch-project

makefile
pre-commit
poetry / (piptools + mamba)
DVC
.vscode
optional devcontainer

banditelol commented 1 year ago

Some templates for reproducibility

CUPID in DE
GH Actions deps using NIX
Clean Arch in DS
DI for functional programming
Cookiecutter for Prefect DS
- DS project Structure
Cookiecutter kaggle dvc
CI/CD ML Model
ML Serving MLFlow
DVC for experiment
Modern Data Stack Stuffs
Missing piece of Modern Data Stack
WTH is Metric Layer

banditelol commented 1 year ago

Toolbox

I want to create a website I mean a book that is like a river station of the toolbox that I usually use today and a little box I could clean the skills or the tools or the framework or the mindset that I have depending on how I could utilize it in a project or in a blog post or daily limit of what I learned

blog idea

Also an idea for an article in my book is to have an explanation of what is the difference between multi index data frame and also a grouper yeah that's what

Another isbthe representation of related by using cube in 3d, how it look clise in 2d doesnt mean it is close in 3d

banditelol commented 1 year ago

Gitlab Handbook is a treasure trove of quality content

banditelol commented 1 year ago

When should we try to purposefully overfit on a batch? https://pytorch-lightning.readthedocs.io/en/stable/common/trainer.html#overfit-batches

banditelol commented 1 year ago

I'm thinking about using oauth2-proxy for simple streamlit app so that we could deploy internal apps gracefully.

banditelol commented 1 year ago

Gotta create a comprehensive checklist based on pythonspeed advice. Also a simple dockerfile for generic projecr https://pythonspeed.com/articles/

banditelol commented 1 year ago

Later if I want to test SQL, this can be a good guide

banditelol commented 1 year ago

banditelol / public-notes

Scratchpad #4

Docker Networking on Windows and Mac

Rebrewing Coffee Grounds

Garrit's Todo List

Sealnet Stuffs

Paper on another animals

Ground truth

Interesting Place to Apply

People That I Follow

Interesting DS Blog

Podcasts

Interesting General Sites

Gitlab's Value

Contributing

Designing Data Platform

Some Exceptions Should not be caught

Several Links for Scraping

Entity Centric Modeling

Semantic Layer

Example

Homeserver

Engineering Readme and Handbook

Also dblock comment

Hokusai

Naming Things

Epistemology

My Near-Ideal Work Platform

Working Remotely

Training Remotely

Aho-corasick string search

Pattern in Data Engineering

Create my own ideal cookiecutter

Some templates for reproducibility

Modern Data Stack Stuffs

Toolbox

blog idea