csirtgadgets / massive-octo-spice

DEPRECATED - USE v3 (bearded-avenger)
https://github.com/csirtgadgets/bearded-avenger-deploymentkit/wiki
GNU Lesser General Public License v3.0
227 stars 62 forks source link

migrate bug #405

Closed villain closed 8 years ago

villain commented 8 years ago

cant re-open the previous issue, creating a new one as suggested;

yep, still having the problem. just did another git pull, getting the same error. i'm migrating from a v1 instance

[2016-04-26T08:42:36,147Z][12427][INFO]: staring up.. [2016-04-26T08:42:36,148Z][12427][INFO]: starting up ES connection... [2016-04-26T08:42:36,149Z][12427][INFO]: checking journal: /tmp/cif-migrate.journal [2016-04-26T08:42:36,149Z][12427][INFO]: creating threads... [2016-04-26T08:42:36,438Z][12427][INFO]: starting workers [2016-04-26T10:14:57,394Z][12427][INFO]: starting writer thread... Subroutine CIF::Legacy::Archive::db_Main redefined at /usr/local/share/perl/5.18.2/Ima/DBI.pm line 278. hash- or arrayref expected (not a simple scalar, use allow_nonref to allow this) at bin/migrate-data.pl line 282.

Segmentation fault (core dumped)

it was working ok until the more recent changes to the migrate script

giovino commented 8 years ago

@villain

Can you download this migrate-data.pl script and run it with the debug flag (-d)?

$ wget https://raw.githubusercontent.com/giovino/massive-octo-spice/develop/v1migration/bin/migrate-data.pl

You can see the changes here.

The changes add some verbose logging (print statement) and attempts to catch some encoding and decoding errors.

giovino commented 8 years ago

It's not clear to me these errors are related to the initial error we are trying to debug. Is it possible the system was left in a poor state? Does restarting and then trying the migrate script again produce different results?

villain commented 8 years ago

^ yeah, that was from a box i've been troubleshooting on, just re-running the query on the proper host now

villain commented 8 years ago

hopefully this looks like what youre expecting:

[2016-04-29T11:52:55,113Z][1821][INFO]: staring up.. [2016-04-29T11:52:55,115Z][1821][INFO]: starting up ES connection... [2016-04-29T11:52:55,116Z][1821][INFO]: checking journal: /tmp/cif-migrate.journal [2016-04-29T11:52:55,135Z][1821][INFO]: creating threads... [2016-04-29T11:52:55,823Z][1821][INFO]: starting workers

[2016-04-29T13:16:05,729Z][1821][INFO]: starting writer thread... Subroutine CIF::Legacy::Archive::db_Main redefined at /usr/local/share/perl/5.18.2/Ima/DBI.pm line 278. $VAR1 = { 'id' => 4544, 'uuid' => '1b9dc4b0-0bec-4a98-b091-ecb4decfbcf8', 'data' => '0ALwVg0AAIA/Er4CCjUKC2Nyb3dkc3RyaWtlGAMiJDFiOWRjNGIwLTBiZWMtNGE5OC1iMDkxLWVj YjRkZWNmYmNmOCIUMjAxMy0wNy0yMlQwNTozNDozNlo6FE4WALhCOQoCRU4SM3Bvc3NpYmx5IG1h bGljaW91cyBkeW5hbWljIGRucyBkb21haW4gKB2OOClKIgoXCA0aAkVOKAMyDQVHTAdtYWx3YXJl KgcIBBUAAL5CWhgKDRrwe3Vua25vd25CA1VUQ1gDYAViIEIeChwKGBIWCAQSBGZxZG4qDHNxdWly bHkuaW5mb1gCcjkSBHV1aWQaCWd1aWQgaGFzaCARSiQ4Yzg2NDMwNi1kMjFhLTM3YjEtODcwNS03 NDZhNzg2NzE5YmZ4ApABAxoEMC4wMSICRU4= ', 'guid' => '8c864306-d21a-37b1-8705-746a786719bf' }; hash- or arrayref expected (not a simple scalar, use allow_nonref to allow this) at bin/migrate-data-debug.pl line 286.

Segmentation fault (core dumped)

giovino commented 8 years ago

@villain

We're getting closer, try this version:

wget https://raw.githubusercontent.com/giovino/massive-octo-spice/develop/v1migration/bin/migrate-data.pl

All changes can be seen here

villain commented 8 years ago

seems like its back to the original output?

[2016-04-30T12:08:42,339Z][5735][INFO]: staring up.. [2016-04-30T12:08:42,340Z][5735][INFO]: starting up ES connection... [2016-04-30T12:08:42,340Z][5735][INFO]: checking journal: /tmp/cif-migrate.journal [2016-04-30T12:08:42,341Z][5735][INFO]: creating threads... [2016-04-30T12:08:42,588Z][5735][INFO]: starting workers [2016-04-30T13:28:07,014Z][5735][INFO]: starting writer thread... Subroutine CIF::Legacy::Archive::db_Main redefined at /usr/local/share/perl/5.18.2/Ima/DBI.pm line 278. hash- or arrayref expected (not a simple scalar, use allow_nonref to allow this) at bin/migrate-data-debug.pl line 286.

Segmentation fault (core dumped)

giovino commented 8 years ago

@villain

Is it possible the root cause of the segmentation fault is the host is running out of memory or do you know that it is a data structure parsing error?

q1: How much memory does this host running this script have? q2: Have you monitored the memory prior to the seg fault?

giovino commented 8 years ago

@villain

and are you running this with the debug flag? (-d)

villain commented 8 years ago

not running out of memory from what i can tell. it has 32GB available, never drops below 4GB

re-run of the debug output:

[2016-05-05T09:25:35,527Z][23928][INFO][main:136]: staring up.. [2016-05-05T09:25:35,527Z][23928][INFO][main:149]: starting up ES connection... [2016-05-05T09:25:35,527Z][23928][INFO][main:156]: checking journal: /tmp/cif-migrate.journal [2016-05-05T09:25:35,528Z][23928][INFO][main:159]: creating threads... [2016-05-05T09:25:35,735Z][23928][INFO][main:214]: starting workers [2016-05-05T09:25:35,736Z][23928][DEBUG][main:224]: connecting to archive..

[2016-05-05T14:12:18,398Z][23928][DEBUG][main:253]: total count: 290774165 [2016-05-05T14:12:18,398Z][23928][DEBUG][main:254]: pages: 58155 [2016-05-05T14:12:18,509Z][23928][DEBUG][main:261]: sending ctrl warm-up msg... [2016-05-05T14:12:18,723Z][23928][DEBUG][main:267]: creating 8 worker threads... [2016-05-05T14:12:18,723Z][23928][INFO][main:333]: starting writer thread... Subroutine CIF::Legacy::Archive::db_Main redefined at /usr/local/share/perl/5.18.2/Ima/DBI.pm line 278. [2016-05-05T14:12:18,875Z][23928][DEBUG][main:401]: starting worker: 3 [2016-05-05T14:12:19,011Z][23928][DEBUG][main:401]: starting worker: 4 [2016-05-05T14:12:19,120Z][23928][DEBUG][main:401]: starting worker: 5 [2016-05-05T14:12:19,269Z][23928][DEBUG][main:401]: starting worker: 6 [2016-05-05T14:12:19,420Z][23928][DEBUG][main:401]: starting worker: 7 [2016-05-05T14:12:19,585Z][23928][DEBUG][main:401]: starting worker: 8 [2016-05-05T14:12:19,723Z][23928][DEBUG][main:401]: starting worker: 9 [2016-05-05T14:12:19,832Z][23928][DEBUG][main:401]: starting worker: 10 [2016-05-05T14:12:19,832Z][23928][DEBUG][main:274]: executing sql... [2016-05-05T14:12:20,341Z][23928][DEBUG][main:280]: sending next pages to workers... hash- or arrayref expected (not a simple scalar, use allow_nonref to allow this) at bin/migrate-data-debug.pl line 286.

Segmentation fault (core dumped)

giovino commented 8 years ago

villian, I apologize for the tardiness in this response, I thought I had responded already. Our guess is, you are hitting a known memory leak with a total count of 290,774,165 records. We've seen segfaults ourselves in a similarly large migration.

As it stands today, because the migration script uses a journal (e.g. it knows what has and hasn't been migrated) our recommendation is to:

  1. Stands up a CIFv2 server
  2. Replicate all the data being ingested into CIFv1 into the CIFv2 box
  3. Stop data being ingested into CIFv1
  4. Run the migration script, if/when it segfaults, restart the script until the migration has been completed.