Open leshier12 opened 6 years ago
Hello leshier12, I was interested in the exact same thing. Have you found any resources for this? Does it look like this would be a possibility with this API?
What the database you use? For PostgreSQL look this post https://stackoverflow.com/questions/23557537/how-to-convert-numpy-array-to-postgresql-list
I believe the approach I would like to take is storing 128 measurements (embedded faces of known face), in some type of database. Then querying this database with a basic machine learning classification algorithm like SVM classifier (or kNN?) using an unknown face grabbed from an image. Any notes how how this type of database could be structured? in facerec_from_video_file.py they build an array of known faces and then call: compare_faces(known_faces, face_encoding, tolerance); I'd like a system that can scale for very large number of known faces (possibly 1 image per known face). In the end, I hope to feed my system a video stream. Thanks for any advice / insight into the performance of compare_faces()!
Having a table of 128 float columns (f1
, f2
, ..., f128
), you can build composite index on several of them (e.g. (f1, f2, f3, f4)
) for selection optimization, and then query
SELECT id, POW(f1 - :e1, 2) + POW(f2 - :e2, 2) + ... + POW(f128 - :e128, 2) AS square_distance
FROM encodings
WHERE
f1 > :minF1 AND f1 < :maxF1 AND
f2 > :minF2 AND f2 < :maxF2 AND
...
f128 > :minF128 AND f128 < :maxF128
ORDER BY square_distance ASC LIMIT 1
where
:eX = encodingX
:minFX = encodingX - 0.1 * abs(encodingX)
:maxFX = encodingX + 0.1 * abs(encodingX)
0.1
defines how strict is selection, 0 is most strict.
This should bring you a row with minimal vector distance to searched encodings. It may also return nothing if the selection is too strict.
Fantastic vearutop. That query is exactly what I'm looking for! I needed that confirmation before moving forward. Thanks so much!
Postgresql has the type CUBE, use it that will be much easier Ex: SELECT c FROM test ORDER BY c <-> cube(array[0.5,0.5,0.5]) LIMIT 1; https://www.postgresql.org/docs/10/static/cube.html Remember : To make it harder for people to break things, there is a limit of 100 on the number of dimensions of cubes. This is set in cubedata.h if you need something bigger. Change cube.data.h to 128 and test.
Thanks Railton! this works perfectly. I'm totally new to Postgresql. I've used "CREATE EXTENSION cube" to import the data type (works up to 100 dimensions). I can't find "cubedata.h" anywhere in the postgresql binary to change the value from 100 dimensions to 128. Does anyone know where to find this? should I be importing the data type cube another way? Thanks in advanced
I can only find cube.sql files in share/postgresql/extensions
Use this container, it is already changed to work up to 350 dimensions. https://github.com/oelmekki/postgres-350d
Thanks Again :)
Adding this for anybody trying to make this type of DB on MacOSX. also for my own future reference when I forget how to do this and need to reinstall. (the docker solution did not work so this is the manual solution that worked for me):
Requirements:
Starters... make sure you have postgresql downloaded for used for 'pg_config' $ brew install postgresql
INSTRUCTIONS: download source for postgresql: https://ftp.postgresql.org/pub/source/v9.6.0/postgresql-9.6.0.tar.bz2 (get the correct version number ___mine is: 10.3)
unzip...
change /contrib/cube/cubedata.h to include 128 dimensions:
//128 float for facial Encodings
Follow Directions in the 'INSTALL' file at top directory, Follow instructions for install and for starting server:
./configure make su make install adduser postgres mkdir /usr/local/pgsql/data chown postgres /usr/local/pgsql/data su - postgres /usr/local/pgsql/bin/initdb -D /usr/local/pgsql/data
***Start Server: /usr/local/pgsql/bin/postgres -D /usr/local/pgsql/data
***note (use this command to change whoami on mac) $ sudo su - postgres
Now we need to add the extension. go to the /contrib/ directory follow directions in the README, we can either make all make all install for all extensions or we can navigate to /contrib/cube/ and just: $ make $ make install for this one extension
Now you want to go to your database and add the extension. For this I just used my GUI and ran the following: CREATE EXTENSION cube
@mmelatti what if I have postgreql already installed on my system ? Should I remove it first or install along other one.
@xenc0d3r when I did it I used the uninstaller to remove the version of postgres I had. Then I downloaded the source for postgres with that link. I also changed the URL and downloaded the current 10.3 version instead of that 9.6 version.
hello @vearutop When we encode the photo, the values in the lis is like -0.09634063 format. How can I convert them into float type in python to store them in single row ?
@xenc0d3r encode using that, base64 or is it saving the array that the library returns?
@railton in vearutop's post (above) he demos how to add threshold for min max and returning "unknown face". do you have any links for postgresql documentation for doing something similar?
I am returning closest match: SELECT c FROM test ORDER BY c <-> cube(array[0.5,0.5,0.5]) LIMIT 1;
I could get a distance vector between the current face encoding and the closest match returned from the database, but I'm concerned about this distance:
not being a good measure if the resolution for the database entry is different from the resolution in the face encoding that we are trying to match?
can I accomplish thresholding with my query? and if not what is the best approach with what I have returned in python program? Thanks!
UPDATE: I believe I've answered my own question (see below). Please feel free to lead feedback for better solutions and more info.
If you want to utilize Postgre's cube, you can use a small trick do it without patching CUBE_MAX_DIM
. You can split all points in two vectors, 64 points each. This violates mathematical model a bit, but for the purpose of finding closest vector should work fine.
I made a small example of PostgreSQL and face_recognition integration: https://github.com/vearutop/face-postgre
@vearutop I am using cube extension and it is good. But if I upload a face which is not in database it returns the most similar face to the uploaded image with LIMIT 1 parameter. Is there a way of fixing this.
@mmelatti You did exactly what I did, sorry for my delay.
can use the hash trick to query face image ? have anyone done
@xxllp you will always have slightly different vector values for same face from different photos, hence you can not query by equality, you can only look for a vector that is closest (by euclidean distance).
Hash is only suitable for exact equality comparison because slightly different vectors would produce completely different hashes, so it is not relevant for the task.
@mmelatti i run python on windows, how to define cube max dim 128 in postgresql on windows. because i run postgresql installer and i don't find cubedata.h to edit. i set data type of face_encoding is public.cube in pgAdmin3
@oknoproblem3 you need recompile PostgreSQL from source https://www.postgresql.org/docs/10/static/install-windows-full.html
@oknoproblem3 You don't really have to recompile PostgreSQL to workaround CUBE limitation. Having your 128 points split into two vectors (64 + 64 for example) you can calculate euclidean distance of whole vector from euclidean distances of two sub vectors:
query = "SELECT id, sqrt(power(CUBE(array[{}]) <-> vec_low, 2) + power(CUBE(array[{}]) <-> vec_high, 2)) as dist FROM encodings ORDER BY dist ASC LIMIT 1".format(
','.join(str(s) for s in encodings[0][0:64]),
','.join(str(s) for s in encodings[0][64:128]),
)
dist
in this expression will be valid to check against the threshold of 0.6
.
EDIT: fixed array slicing to include last elements ([0:63]
-> [0:64]
, [64:127]
-> [64:128]
).
@vearutop
128 points split into 64 + 64, follow your last advice, one thing makes me confused, we are finding the smallest distance.
why don't we use
CUBE(array[{}]) <-> vec_low + CUBE(array[{}]) <-> vec_high
instead of
sqrt(power(CUBE(array[{}]) <-> vec_low, 2) + power(CUBE(array[{}]) <-> vec_high, 2))
it looks like "power" then "sqrt" have no effects on result (distance which biger is more biger and smaller is more smaller)
Euclidean distance between (a1,b1,c1,d1)
and (a2,b2,c2,d2)
is sqrt((a1-a2)^2+(b1-b2)^2+(c1-c2)^2+(d1-d2)^2)
.
Mathematically sqrt((a1-a2)^2+(b1-b2)^2+(c1-c2)^2+(d1-d2)^2) != sqrt((a1-a2)^2+(b1-b2)^2) + sqrt((c1-c2)^2+(d1-d2)^2)
.
If you ^2 left and right parts you'll have (remember the calculus equation (a+b)^2 = a^2 + 2ab + b^2
):
(a1-a2)^2+(b1-b2)^2+(c1-c2)^2+(d1-d2)^2 != (a1-a2)^2+(b1-b2)^2 + 2*sqrt((a1-a2)^2+(b1-b2)^2)*sqrt((c1-c2)^2+(d1-d2)^2)+ (c1-c2)^2+(d1-d2)^2
Sorry for poor math formatting :)
Do you guys have sample efficient query for MS SQL?
What an excellent thread, guys! @railton thank you for pointing out to cube
type. I had no idea.
I needed to store text embedding so my vector is 512 items. For those who need PSQL with cube that supports more than 100 items I created docker images with patched extension setting limit to 2048. I created builds for 10.7 and 11.2. Feel free to use them
https://hub.docker.com/r/expert/postgresql-large-cube/tags https://github.com/unoexperto/docker-postgresql-large-cube
@unoexperto Just out of curiosity, what do you store more than the 128 points on the face?
And thank you for your contribution.
@railton My use-case is different. I store embeddings that encode meaning of scientific article.
@vearutop What do you suggest, Shall i go for a modified Postgres or we can just go with Vector Split for 64 points
@jayaraj if you have enough control over Postgres to build/deploy patched version then going for single vector would be best in terms of simplicity and likely performance.
Vector split is a workaround if you can not use patched instance (e.g. AWS RDS or security implications in company).
@unoexperto Thanks for sharing your docker, but do you have any tutorial how to use your docker about postgresql-large-cube ? Thanks!
If you want to utilize Postgre's cube, you can use a small trick do it without patching
CUBE_MAX_DIM
. You can split all points in two vectors, 64 points each. This violates mathematical model a bit, but for the purpose of finding closest vector should work fine.I made a small example of PostgreSQL and face_recognition integration: https://github.com/vearutop/face-postgre
That doesnt work. It keeps throwing None as output
If you want to utilize Postgre's cube, you can use a small trick do it without patching
CUBE_MAX_DIM
. You can split all points in two vectors, 64 points each. This violates mathematical model a bit, but for the purpose of finding closest vector should work fine. I made a small example of PostgreSQL and face_recognition integration: https://github.com/vearutop/face-postgreThat doesnt work. It keeps throwing None as output
SELECT last_name,first_name,convert_from(face_encoding::bytea, 'utf-8') as face_encoding from people "You can try this way"
It would be really nice if you could add an example which shows how the face encodings can be stored in a database and how to efficiently query them.
Hello @leshier12 . Did you find a way of doing this?
https://www.elastic.co/blog/how-to-build-a-facial-recognition-system-using-elasticsearch-and-python This way is very efficient
多么棒的线程啊,伙计们!@railton感谢您指出要
cube
键入。我不知道。我需要存储文本嵌入,所以我的向量是 512 项。对于那些需要支持超过 100 个项目的多维数据集的 PSQL 的人,我创建了 docker 图像,并将修补扩展设置限制为 2048。我为 10.7 和 11.2 创建了构建。随意使用它们
https://hub.docker.com/r/expert/postgresql-large-cube/tags https://github.com/unoexperto/docker-postgresql-large-cube
Hello! Can I build an image of arm64 architecture? Or you can teach me how to build
多么棒的线程啊,伙计们!@railton感谢您指出要
cube
键入。我不知道。我需要存储文本嵌入,所以我的向量是 512 项。对于那些需要支持超过 100 个项目的多维数据集的 PSQL 的人,我创建了 docker 图像,并将修补扩展设置限制为 2048。我为 10.7 和 11.2 创建了构建。随意使用它们
https://hub.docker.com/r/expert/postgresql-large-cube/tags https://github.com/unoexperto/docker-postgresql-large-cube
Can you tell me what's wrong? My dockerfile I added this cd /usr/src/postgresql/contrib/cube \ sed -i 's/#define CUBE_MAX_DIM (100)/#define CUBE_MAX_DIM (350)/' cubedata.h; \
#
# NOTE: THIS DOCKERFILE IS GENERATED VIA "apply-templates.sh"
#
# PLEASE DO NOT EDIT IT DIRECTLY.
#
FROM alpine:3.15
# 70 is the standard uid/gid for "postgres" in Alpine
# https://git.alpinelinux.org/aports/tree/main/postgresql/postgresql.pre-install?h=3.12-stable
RUN set -eux; \
addgroup -g 70 -S postgres; \
adduser -u 70 -S -D -G postgres -H -h /var/lib/postgresql -s /bin/sh postgres; \
mkdir -p /var/lib/postgresql; \
chown -R postgres:postgres /var/lib/postgresql
# su-exec (gosu-compatible) is installed further down
# make the "en_US.UTF-8" locale so postgres will be utf-8 enabled by default
# alpine doesn't require explicit locale-file generation
ENV LANG en_US.utf8
RUN mkdir /docker-entrypoint-initdb.d
ENV PG_MAJOR 9.6
ENV PG_VERSION 9.6.24
ENV PG_SHA256 aeb7a196be3ebed1a7476ef565f39722187c108dd47da7489be9c4fcae982ace
RUN sed -i 's/dl-cdn.alpinelinux.org/mirrors.aliyun.com/g' /etc/apk/repositories
RUN set -eux; \
\
wget -O postgresql.tar.bz2 "https://ftp.postgresql.org/pub/source/v$PG_VERSION/postgresql-$PG_VERSION.tar.bz2"; \
echo "$PG_SHA256 *postgresql.tar.bz2" | sha256sum -c -; \
mkdir -p /usr/src/postgresql; \
tar \
--extract \
--file postgresql.tar.bz2 \
--directory /usr/src/postgresql \
--strip-components 1 \
; \
rm postgresql.tar.bz2; \
\
apk add --no-cache --virtual .build-deps \
bison \
coreutils \
dpkg-dev dpkg \
flex \
gcc \
krb5-dev \
libc-dev \
libedit-dev \
libxml2-dev \
libxslt-dev \
linux-headers \
make \
openldap-dev \
openssl-dev \
# configure: error: prove not found
perl-utils \
# configure: error: Perl module IPC::Run is required to run TAP tests
perl-ipc-run \
perl-dev \
python3-dev \
tcl-dev \
util-linux-dev \
zlib-dev \
; \
\
cd /usr/src/postgresql/contrib/cube \
sed -i 's/#define CUBE_MAX_DIM (100)/#define CUBE_MAX_DIM (350)/' cubedata.h; \
cd /usr/src/postgresql; \
# update "DEFAULT_PGSOCKET_DIR" to "/var/run/postgresql" (matching Debian)
# see https://anonscm.debian.org/git/pkg-postgresql/postgresql.git/tree/debian/patches/51-default-sockets-in-var.patch?id=8b539fcb3e093a521c095e70bdfa76887217b89f
awk '$1 == "#define" && $2 == "DEFAULT_PGSOCKET_DIR" && $3 == "\"/tmp\"" { $3 = "\"/var/run/postgresql\""; print; next } { print }' src/include/pg_config_manual.h > src/include/pg_config_manual.h.new; \
grep '/var/run/postgresql' src/include/pg_config_manual.h.new; \
mv src/include/pg_config_manual.h.new src/include/pg_config_manual.h; \
gnuArch="$(dpkg-architecture --query DEB_BUILD_GNU_TYPE)"; \
# explicitly update autoconf config.guess and config.sub so they support more arches/libcs
wget -O config/config.guess 'https://git.savannah.gnu.org/cgit/config.git/plain/config.guess?id=7d3d27baf8107b630586c962c057e22149653deb'; \
wget -O config/config.sub 'https://git.savannah.gnu.org/cgit/config.git/plain/config.sub?id=7d3d27baf8107b630586c962c057e22149653deb'; \
# configure options taken from:
# https://anonscm.debian.org/cgit/pkg-postgresql/postgresql.git/tree/debian/rules?h=9.5
./configure \
--build="$gnuArch" \
# "/usr/src/postgresql/src/backend/access/common/tupconvert.c:105: undefined reference to `libintl_gettext'"
# --enable-nls \
--enable-integer-datetimes \
--enable-thread-safety \
--enable-tap-tests \
# skip debugging info -- we want tiny size instead
# --enable-debug \
--disable-rpath \
--with-uuid=e2fs \
--with-gnu-ld \
--with-pgport=5432 \
--with-system-tzdata=/usr/share/zoneinfo \
--prefix=/usr/local \
--with-includes=/usr/local/include \
--with-libraries=/usr/local/lib \
--with-krb5 \
--with-gssapi \
--with-ldap \
--with-tcl \
--with-perl \
--with-python \
# --with-pam \
--with-openssl \
--with-libxml \
--with-libxslt \
; \
make -j "$(nproc)" world; \
make install-world; \
make -C contrib install; \
\
runDeps="$( \
scanelf --needed --nobanner --format '%n#p' --recursive /usr/local \
| tr ',' '\n' \
| sort -u \
| awk 'system("[ -e /usr/local/lib/" $1 " ]") == 0 { next } { print "so:" $1 }' \
# Remove plperl, plpython and pltcl dependencies by default to save image size
# To use the pl extensions, those have to be installed in a derived image
| grep -v -e perl -e python -e tcl \
)"; \
apk add --no-cache --virtual .postgresql-rundeps \
$runDeps \
bash \
su-exec \
# tzdata is optional, but only adds around 1Mb to image size and is recommended by Django documentation:
# https://docs.djangoproject.com/en/1.10/ref/databases/#optimizing-postgresql-s-configuration
tzdata \
; \
apk del --no-network .build-deps; \
cd /; \
rm -rf \
/usr/src/postgresql \
/usr/local/share/doc \
/usr/local/share/man \
; \
\
postgres --version
# make the sample config easier to munge (and "correct by default")
RUN set -eux; \
cp -v /usr/local/share/postgresql/postgresql.conf.sample /usr/local/share/postgresql/postgresql.conf.sample.orig; \
sed -ri "s!^#?(listen_addresses)\s*=\s*\S+.*!\1 = '*'!" /usr/local/share/postgresql/postgresql.conf.sample; \
grep -F "listen_addresses = '*'" /usr/local/share/postgresql/postgresql.conf.sample
RUN mkdir -p /var/run/postgresql && chown -R postgres:postgres /var/run/postgresql && chmod 2777 /var/run/postgresql
ENV PGDATA /var/lib/postgresql/data
# this 777 will be replaced by 700 at runtime (allows semi-arbitrary "--user" values)
RUN mkdir -p "$PGDATA" && chown -R postgres:postgres "$PGDATA" && chmod 777 "$PGDATA"
VOLUME /var/lib/postgresql/data
COPY docker-entrypoint.sh /usr/local/bin/
RUN ln -s usr/local/bin/docker-entrypoint.sh / # backwards compat
ENTRYPOINT ["docker-entrypoint.sh"]
# We set the default STOPSIGNAL to SIGINT, which corresponds to what PostgreSQL
# calls "Fast Shutdown mode" wherein new connections are disallowed and any
# in-progress transactions are aborted, allowing PostgreSQL to stop cleanly and
# flush tables to disk, which is the best compromise available to avoid data
# corruption.
#
# Users who know their applications do not keep open long-lived idle connections
# may way to use a value of SIGTERM instead, which corresponds to "Smart
# Shutdown mode" in which any existing sessions are allowed to finish and the
# server stops when all sessions are terminated.
#
# See https://www.postgresql.org/docs/12/server-shutdown.html for more details
# about available PostgreSQL server shutdown signals.
#
# See also https://www.postgresql.org/docs/12/server-start.html for further
# justification of this as the default value, namely that the example (and
# shipped) systemd service files use the "Fast Shutdown mode" for service
# termination.
#
STOPSIGNAL SIGINT
#
# An additional setting that is recommended for all users regardless of this
# value is the runtime "--stop-timeout" (or your orchestrator/runtime's
# equivalent) for controlling how long to wait between sending the defined
# STOPSIGNAL and sending SIGKILL (which is likely to cause data corruption).
#
# The default in most runtimes (such as Docker) is 10 seconds, and the
# documentation at https://www.postgresql.org/docs/12/server-start.html notes
# that even 90 seconds may not be long enough in many instances.
EXPOSE 5432
CMD ["postgres"]
多么棒的线程啊,伙计们!@railton感谢您指出要
cube
键入。我不知道。我需要存储文本嵌入,所以我的项目是 512。对于那些需要支持超过 100 个项目的多维数据集的 PSQL 的人,我创建了 docker 图像,修复扩展设置限制为 2048。我为 10.7 和11.2 创建制造了。随意使用它们
https://hub.docker.com/r/expert/postgresql-large-cube/tags https://github.com/unoexperto/docker-postgresql-large-cube
I compiled successfully, but the limit is 100
It would be really nice if you could add an example which shows how the face encodings can be stored in a database and how to efficiently query them.