leshier12 commented 6 years ago

It would be really nice if you could add an example which shows how the face encodings can be stored in a database and how to efficiently query them.

mmelatti commented 6 years ago

Hello leshier12, I was interested in the exact same thing. Have you found any resources for this? Does it look like this would be a possibility with this API?

DeadNumbers commented 6 years ago

What the database you use? For PostgreSQL look this post https://stackoverflow.com/questions/23557537/how-to-convert-numpy-array-to-postgresql-list

mmelatti commented 6 years ago

I believe the approach I would like to take is storing 128 measurements (embedded faces of known face), in some type of database. Then querying this database with a basic machine learning classification algorithm like SVM classifier (or kNN?) using an unknown face grabbed from an image. Any notes how how this type of database could be structured? in facerec_from_video_file.py they build an array of known faces and then call: compare_faces(known_faces, face_encoding, tolerance); I'd like a system that can scale for very large number of known faces (possibly 1 image per known face). In the end, I hope to feed my system a video stream. Thanks for any advice / insight into the performance of compare_faces()!

vearutop commented 6 years ago

Having a table of 128 float columns (f1, f2, ..., f128), you can build composite index on several of them (e.g. (f1, f2, f3, f4)) for selection optimization, and then query

SELECT id, POW(f1 - :e1, 2) + POW(f2 - :e2, 2) + ... + POW(f128 - :e128, 2) AS square_distance
FROM encodings
WHERE 
 f1 > :minF1 AND f1 < :maxF1 AND 
 f2 > :minF2 AND f2 < :maxF2 AND 
 ... 
 f128 > :minF128 AND f128 < :maxF128
ORDER BY square_distance ASC LIMIT 1

where

:eX = encodingX
:minFX = encodingX - 0.1 * abs(encodingX)
:maxFX =  encodingX + 0.1 * abs(encodingX)

0.1 defines how strict is selection, 0 is most strict.

This should bring you a row with minimal vector distance to searched encodings. It may also return nothing if the selection is too strict.

mmelatti commented 6 years ago

Fantastic vearutop. That query is exactly what I'm looking for! I needed that confirmation before moving forward. Thanks so much!

railton commented 6 years ago

Postgresql has the type CUBE, use it that will be much easier Ex: SELECT c FROM test ORDER BY c <-> cube(array[0.5,0.5,0.5]) LIMIT 1; https://www.postgresql.org/docs/10/static/cube.html Remember : To make it harder for people to break things, there is a limit of 100 on the number of dimensions of cubes. This is set in cubedata.h if you need something bigger. Change cube.data.h to 128 and test.

mmelatti commented 6 years ago

Thanks Railton! this works perfectly. I'm totally new to Postgresql. I've used "CREATE EXTENSION cube" to import the data type (works up to 100 dimensions). I can't find "cubedata.h" anywhere in the postgresql binary to change the value from 100 dimensions to 128. Does anyone know where to find this? should I be importing the data type cube another way? Thanks in advanced

I can only find cube.sql files in share/postgresql/extensions

railton commented 6 years ago

Use this container, it is already changed to work up to 350 dimensions. https://github.com/oelmekki/postgres-350d

mmelatti commented 6 years ago

Thanks Again :)

mmelatti commented 6 years ago

Adding this for anybody trying to make this type of DB on MacOSX. also for my own future reference when I forget how to do this and need to reinstall. (the docker solution did not work so this is the manual solution that worked for me):

Requirements:

Need to have Xcode downloaded for Cmake
Homebrew

Starters... make sure you have postgresql downloaded for used for 'pg_config' $ brew install postgresql

INSTRUCTIONS: download source for postgresql: https://ftp.postgresql.org/pub/source/v9.6.0/postgresql-9.6.0.tar.bz2 (get the correct version number ___mine is: 10.3)

unzip...

change /contrib/cube/cubedata.h to include 128 dimensions:

define CUBE_MAX_DIM (100) -> #define CUBE_MAX_DIM (128)

//128 float for facial Encodings

Follow Directions in the 'INSTALL' file at top directory, Follow instructions for install and for starting server:

./configure make su make install adduser postgres mkdir /usr/local/pgsql/data chown postgres /usr/local/pgsql/data su - postgres /usr/local/pgsql/bin/initdb -D /usr/local/pgsql/data

***Start Server: /usr/local/pgsql/bin/postgres -D /usr/local/pgsql/data

***note (use this command to change whoami on mac) $ sudo su - postgres

Now we need to add the extension. go to the /contrib/ directory follow directions in the README, we can either make all make all install for all extensions or we can navigate to /contrib/cube/ and just: $ make $ make install for this one extension

Now you want to go to your database and add the extension. For this I just used my GUI and ran the following: CREATE EXTENSION cube

xenc0d3r commented 6 years ago

@mmelatti what if I have postgreql already installed on my system ? Should I remove it first or install along other one.

mmelatti commented 6 years ago

@xenc0d3r when I did it I used the uninstaller to remove the version of postgres I had. Then I downloaded the source for postgres with that link. I also changed the URL and downloaded the current 10.3 version instead of that 9.6 version.

xenc0d3r commented 6 years ago

hello @vearutop When we encode the photo, the values in the lis is like -0.09634063 format. How can I convert them into float type in python to store them in single row ?

railton commented 6 years ago

@xenc0d3r encode using that, base64 or is it saving the array that the library returns?

mmelatti commented 6 years ago

@railton in vearutop's post (above) he demos how to add threshold for min max and returning "unknown face". do you have any links for postgresql documentation for doing something similar?

I am returning closest match: SELECT c FROM test ORDER BY c <-> cube(array[0.5,0.5,0.5]) LIMIT 1;

I could get a distance vector between the current face encoding and the closest match returned from the database, but I'm concerned about this distance:

not being a good measure if the resolution for the database entry is different from the resolution in the face encoding that we are trying to match?

can I accomplish thresholding with my query? and if not what is the best approach with what I have returned in python program? Thanks!

UPDATE: I believe I've answered my own question (see below). Please feel free to lead feedback for better solutions and more info.

vearutop commented 6 years ago

If you want to utilize Postgre's cube, you can use a small trick do it without patching CUBE_MAX_DIM. You can split all points in two vectors, 64 points each. This violates mathematical model a bit, but for the purpose of finding closest vector should work fine.

I made a small example of PostgreSQL and face_recognition integration: https://github.com/vearutop/face-postgre

xenc0d3r commented 6 years ago

@vearutop I am using cube extension and it is good. But if I upload a face which is not in database it returns the most similar face to the uploaded image with LIMIT 1 parameter. Is there a way of fixing this.

railton commented 6 years ago

@mmelatti You did exactly what I did, sorry for my delay.

xxllp commented 6 years ago

can use the hash trick to query face image ? have anyone done

vearutop commented 6 years ago

@xxllp you will always have slightly different vector values for same face from different photos, hence you can not query by equality, you can only look for a vector that is closest (by euclidean distance).

Hash is only suitable for exact equality comparison because slightly different vectors would produce completely different hashes, so it is not relevant for the task.

oknoproblem3 commented 6 years ago

@mmelatti i run python on windows, how to define cube max dim 128 in postgresql on windows. because i run postgresql installer and i don't find cubedata.h to edit. i set data type of face_encoding is public.cube in pgAdmin3 capture2

DeadNumbers commented 6 years ago

@oknoproblem3 you need recompile PostgreSQL from source https://www.postgresql.org/docs/10/static/install-windows-full.html

vearutop commented 6 years ago

@oknoproblem3 You don't really have to recompile PostgreSQL to workaround CUBE limitation. Having your 128 points split into two vectors (64 + 64 for example) you can calculate euclidean distance of whole vector from euclidean distances of two sub vectors:

            query = "SELECT id, sqrt(power(CUBE(array[{}]) <-> vec_low, 2) + power(CUBE(array[{}]) <-> vec_high, 2)) as dist FROM encodings ORDER BY dist ASC LIMIT 1".format(
                ','.join(str(s) for s in encodings[0][0:64]),
                ','.join(str(s) for s in encodings[0][64:128]),
            )

dist in this expression will be valid to check against the threshold of 0.6.

EDIT: fixed array slicing to include last elements ([0:63] -> [0:64], [64:127] -> [64:128]).

why2lyj commented 6 years ago

@vearutop

128 points split into 64 + 64, follow your last advice, one thing makes me confused, we are finding the smallest distance.

why don't we use

CUBE(array[{}]) <-> vec_low + CUBE(array[{}]) <-> vec_high

instead of

sqrt(power(CUBE(array[{}]) <-> vec_low, 2) + power(CUBE(array[{}]) <-> vec_high, 2))

it looks like "power" then "sqrt" have no effects on result (distance which biger is more biger and smaller is more smaller)

vearutop commented 6 years ago

Euclidean distance between (a1,b1,c1,d1) and (a2,b2,c2,d2) is sqrt((a1-a2)^2+(b1-b2)^2+(c1-c2)^2+(d1-d2)^2).

Mathematically sqrt((a1-a2)^2+(b1-b2)^2+(c1-c2)^2+(d1-d2)^2) != sqrt((a1-a2)^2+(b1-b2)^2) + sqrt((c1-c2)^2+(d1-d2)^2).

If you ^2 left and right parts you'll have (remember the calculus equation (a+b)^2 = a^2 + 2ab + b^2):

(a1-a2)^2+(b1-b2)^2+(c1-c2)^2+(d1-d2)^2 != (a1-a2)^2+(b1-b2)^2 + 2*sqrt((a1-a2)^2+(b1-b2)^2)*sqrt((c1-c2)^2+(d1-d2)^2)+ (c1-c2)^2+(d1-d2)^2

Sorry for poor math formatting :)

devmikko commented 6 years ago

Do you guys have sample efficient query for MS SQL?

unoexperto commented 5 years ago

What an excellent thread, guys! @railton thank you for pointing out to cube type. I had no idea.

I needed to store text embedding so my vector is 512 items. For those who need PSQL with cube that supports more than 100 items I created docker images with patched extension setting limit to 2048. I created builds for 10.7 and 11.2. Feel free to use them

https://hub.docker.com/r/expert/postgresql-large-cube/tags https://github.com/unoexperto/docker-postgresql-large-cube

railton commented 5 years ago

@unoexperto Just out of curiosity, what do you store more than the 128 points on the face?

And thank you for your contribution.

unoexperto commented 5 years ago

@railton My use-case is different. I store embeddings that encode meaning of scientific article.

jayaraj commented 4 years ago

@vearutop What do you suggest, Shall i go for a modified Postgres or we can just go with Vector Split for 64 points

vearutop commented 4 years ago

@jayaraj if you have enough control over Postgres to build/deploy patched version then going for single vector would be best in terms of simplicity and likely performance.

Vector split is a workaround if you can not use patched instance (e.g. AWS RDS or security implications in company).

Ruberben commented 4 years ago

@unoexperto Thanks for sharing your docker, but do you have any tutorial how to use your docker about postgresql-large-cube ? Thanks!

thakurritesh19 commented 4 years ago

If you want to utilize Postgre's cube, you can use a small trick do it without patching CUBE_MAX_DIM. You can split all points in two vectors, 64 points each. This violates mathematical model a bit, but for the purpose of finding closest vector should work fine.

I made a small example of PostgreSQL and face_recognition integration: https://github.com/vearutop/face-postgre

That doesnt work. It keeps throwing None as output

thoangnguyen1308 commented 4 years ago

If you want to utilize Postgre's cube, you can use a small trick do it without patching CUBE_MAX_DIM. You can split all points in two vectors, 64 points each. This violates mathematical model a bit, but for the purpose of finding closest vector should work fine. I made a small example of PostgreSQL and face_recognition integration: https://github.com/vearutop/face-postgre

That doesnt work. It keeps throwing None as output

SELECT last_name,first_name,convert_from(face_encoding::bytea, 'utf-8') as face_encoding from people "You can try this way"

Dancungerald commented 3 years ago

It would be really nice if you could add an example which shows how the face encodings can be stored in a database and how to efficiently query them.

Hello @leshier12 . Did you find a way of doing this?

railton commented 3 years ago

https://www.elastic.co/blog/how-to-build-a-facial-recognition-system-using-elasticsearch-and-python This way is very efficient

zzy-life commented 2 years ago

多么棒的线程啊，伙计们！@railton感谢您指出要cube键入。我不知道。

我需要存储文本嵌入，所以我的向量是 512 项。对于那些需要支持超过 100 个项目的多维数据集的 PSQL 的人，我创建了 docker 图像，并将修补扩展设置限制为 2048。我为 10.7 和 11.2 创建了构建。随意使用它们

https://hub.docker.com/r/expert/postgresql-large-cube/tags https://github.com/unoexperto/docker-postgresql-large-cube

Hello! Can I build an image of arm64 architecture? Or you can teach me how to build

zzy-life commented 2 years ago

多么棒的线程啊，伙计们！@railton感谢您指出要cube键入。我不知道。

我需要存储文本嵌入，所以我的向量是 512 项。对于那些需要支持超过 100 个项目的多维数据集的 PSQL 的人，我创建了 docker 图像，并将修补扩展设置限制为 2048。我为 10.7 和 11.2 创建了构建。随意使用它们

https://hub.docker.com/r/expert/postgresql-large-cube/tags https://github.com/unoexperto/docker-postgresql-large-cube

Can you tell me what's wrong？ My dockerfile I added this cd /usr/src/postgresql/contrib/cube \ sed -i 's/#define CUBE_MAX_DIM (100)/#define CUBE_MAX_DIM (350)/' cubedata.h; \

#
# NOTE: THIS DOCKERFILE IS GENERATED VIA "apply-templates.sh"
#
# PLEASE DO NOT EDIT IT DIRECTLY.
#

FROM alpine:3.15

# 70 is the standard uid/gid for "postgres" in Alpine
# https://git.alpinelinux.org/aports/tree/main/postgresql/postgresql.pre-install?h=3.12-stable
RUN set -eux; \
    addgroup -g 70 -S postgres; \
    adduser -u 70 -S -D -G postgres -H -h /var/lib/postgresql -s /bin/sh postgres; \
    mkdir -p /var/lib/postgresql; \
    chown -R postgres:postgres /var/lib/postgresql

# su-exec (gosu-compatible) is installed further down

# make the "en_US.UTF-8" locale so postgres will be utf-8 enabled by default
# alpine doesn't require explicit locale-file generation
ENV LANG en_US.utf8

RUN mkdir /docker-entrypoint-initdb.d

ENV PG_MAJOR 9.6
ENV PG_VERSION 9.6.24
ENV PG_SHA256 aeb7a196be3ebed1a7476ef565f39722187c108dd47da7489be9c4fcae982ace
RUN sed -i 's/dl-cdn.alpinelinux.org/mirrors.aliyun.com/g' /etc/apk/repositories
RUN set -eux; \
    \
    wget -O postgresql.tar.bz2 "https://ftp.postgresql.org/pub/source/v$PG_VERSION/postgresql-$PG_VERSION.tar.bz2"; \
    echo "$PG_SHA256 *postgresql.tar.bz2" | sha256sum -c -; \
    mkdir -p /usr/src/postgresql; \
    tar \
        --extract \
        --file postgresql.tar.bz2 \
        --directory /usr/src/postgresql \
        --strip-components 1 \
    ; \
    rm postgresql.tar.bz2; \

    \
    apk add --no-cache --virtual .build-deps \
        bison \
        coreutils \
        dpkg-dev dpkg \
        flex \
        gcc \
        krb5-dev \
        libc-dev \
        libedit-dev \
        libxml2-dev \
        libxslt-dev \
        linux-headers \
        make \
        openldap-dev \
        openssl-dev \
# configure: error: prove not found
        perl-utils \
# configure: error: Perl module IPC::Run is required to run TAP tests
        perl-ipc-run \
        perl-dev \
        python3-dev \
        tcl-dev \
        util-linux-dev \
        zlib-dev \
    ; \
    \
    cd /usr/src/postgresql/contrib/cube \
    sed -i 's/#define CUBE_MAX_DIM (100)/#define CUBE_MAX_DIM (350)/' cubedata.h; \
    cd /usr/src/postgresql; \

# update "DEFAULT_PGSOCKET_DIR" to "/var/run/postgresql" (matching Debian)
# see https://anonscm.debian.org/git/pkg-postgresql/postgresql.git/tree/debian/patches/51-default-sockets-in-var.patch?id=8b539fcb3e093a521c095e70bdfa76887217b89f
    awk '$1 == "#define" && $2 == "DEFAULT_PGSOCKET_DIR" && $3 == "\"/tmp\"" { $3 = "\"/var/run/postgresql\""; print; next } { print }' src/include/pg_config_manual.h > src/include/pg_config_manual.h.new; \
    grep '/var/run/postgresql' src/include/pg_config_manual.h.new; \
    mv src/include/pg_config_manual.h.new src/include/pg_config_manual.h; \
    gnuArch="$(dpkg-architecture --query DEB_BUILD_GNU_TYPE)"; \
# explicitly update autoconf config.guess and config.sub so they support more arches/libcs
    wget -O config/config.guess 'https://git.savannah.gnu.org/cgit/config.git/plain/config.guess?id=7d3d27baf8107b630586c962c057e22149653deb'; \
    wget -O config/config.sub 'https://git.savannah.gnu.org/cgit/config.git/plain/config.sub?id=7d3d27baf8107b630586c962c057e22149653deb'; \
# configure options taken from:
# https://anonscm.debian.org/cgit/pkg-postgresql/postgresql.git/tree/debian/rules?h=9.5
    ./configure \
        --build="$gnuArch" \
# "/usr/src/postgresql/src/backend/access/common/tupconvert.c:105: undefined reference to `libintl_gettext'"
#       --enable-nls \
        --enable-integer-datetimes \
        --enable-thread-safety \
        --enable-tap-tests \
# skip debugging info -- we want tiny size instead
#       --enable-debug \
        --disable-rpath \
        --with-uuid=e2fs \
        --with-gnu-ld \
        --with-pgport=5432 \
        --with-system-tzdata=/usr/share/zoneinfo \
        --prefix=/usr/local \
        --with-includes=/usr/local/include \
        --with-libraries=/usr/local/lib \
        --with-krb5 \
        --with-gssapi \
        --with-ldap \
        --with-tcl \
        --with-perl \
        --with-python \
#       --with-pam \
        --with-openssl \
        --with-libxml \
        --with-libxslt \
    ; \
    make -j "$(nproc)" world; \
    make install-world; \
    make -C contrib install; \
    \
    runDeps="$( \
        scanelf --needed --nobanner --format '%n#p' --recursive /usr/local \
            | tr ',' '\n' \
            | sort -u \
            | awk 'system("[ -e /usr/local/lib/" $1 " ]") == 0 { next } { print "so:" $1 }' \
# Remove plperl, plpython and pltcl dependencies by default to save image size
# To use the pl extensions, those have to be installed in a derived image
            | grep -v -e perl -e python -e tcl \
    )"; \
    apk add --no-cache --virtual .postgresql-rundeps \
        $runDeps \
        bash \
        su-exec \
# tzdata is optional, but only adds around 1Mb to image size and is recommended by Django documentation:
# https://docs.djangoproject.com/en/1.10/ref/databases/#optimizing-postgresql-s-configuration
        tzdata \
    ; \
    apk del --no-network .build-deps; \
    cd /; \
    rm -rf \
        /usr/src/postgresql \
        /usr/local/share/doc \
        /usr/local/share/man \
    ; \
    \
    postgres --version

# make the sample config easier to munge (and "correct by default")
RUN set -eux; \
    cp -v /usr/local/share/postgresql/postgresql.conf.sample /usr/local/share/postgresql/postgresql.conf.sample.orig; \
    sed -ri "s!^#?(listen_addresses)\s*=\s*\S+.*!\1 = '*'!" /usr/local/share/postgresql/postgresql.conf.sample; \
    grep -F "listen_addresses = '*'" /usr/local/share/postgresql/postgresql.conf.sample

RUN mkdir -p /var/run/postgresql && chown -R postgres:postgres /var/run/postgresql && chmod 2777 /var/run/postgresql

ENV PGDATA /var/lib/postgresql/data
# this 777 will be replaced by 700 at runtime (allows semi-arbitrary "--user" values)
RUN mkdir -p "$PGDATA" && chown -R postgres:postgres "$PGDATA" && chmod 777 "$PGDATA"
VOLUME /var/lib/postgresql/data

COPY docker-entrypoint.sh /usr/local/bin/
RUN ln -s usr/local/bin/docker-entrypoint.sh / # backwards compat
ENTRYPOINT ["docker-entrypoint.sh"]

# We set the default STOPSIGNAL to SIGINT, which corresponds to what PostgreSQL
# calls "Fast Shutdown mode" wherein new connections are disallowed and any
# in-progress transactions are aborted, allowing PostgreSQL to stop cleanly and
# flush tables to disk, which is the best compromise available to avoid data
# corruption.
#
# Users who know their applications do not keep open long-lived idle connections
# may way to use a value of SIGTERM instead, which corresponds to "Smart
# Shutdown mode" in which any existing sessions are allowed to finish and the
# server stops when all sessions are terminated.
#
# See https://www.postgresql.org/docs/12/server-shutdown.html for more details
# about available PostgreSQL server shutdown signals.
#
# See also https://www.postgresql.org/docs/12/server-start.html for further
# justification of this as the default value, namely that the example (and
# shipped) systemd service files use the "Fast Shutdown mode" for service
# termination.
#
STOPSIGNAL SIGINT
#
# An additional setting that is recommended for all users regardless of this
# value is the runtime "--stop-timeout" (or your orchestrator/runtime's
# equivalent) for controlling how long to wait between sending the defined
# STOPSIGNAL and sending SIGKILL (which is likely to cause data corruption).
#
# The default in most runtimes (such as Docker) is 10 seconds, and the
# documentation at https://www.postgresql.org/docs/12/server-start.html notes
# that even 90 seconds may not be long enough in many instances.

EXPOSE 5432
CMD ["postgres"]

zzy-life commented 2 years ago

多么棒的线程啊，伙计们！@railton感谢您指出要cube键入。我不知道。

我需要存储文本嵌入，所以我的项目是 512。对于那些需要支持超过 100 个项目的多维数据集的 PSQL 的人，我创建了 docker 图像，修复扩展设置限制为 2048。我为 10.7 和11.2 创建制造了。随意使用它们

https://hub.docker.com/r/expert/postgresql-large-cube/tags https://github.com/unoexperto/docker-postgresql-large-cube

I compiled successfully, but the limit is 100

ageitgey / face_recognition

how to effciently query them in database?please! #403

define CUBE_MAX_DIM (100) -> #define CUBE_MAX_DIM (128)