Huge trajectory files can not be processed

chraibi commented 5 years ago

In Gitlab by @anna-braun on Mar 1, 2019, 10:56 [origin]

I wanted to do some calculations with jpsreport, but generated a segmentation fault:
segmentation_fault

The trajectory file has about 2GB.
You can find the files (trajectory, report, log) here:
https://fz-juelich.sciebo.de/s/6BJJFC84ugUlm8r

chraibi commented 5 years ago

In Gitlab by @anna-braun on Mar 2, 2019, 20:16

changed the description

chraibi commented 5 years ago

In Gitlab by @chraibi on Mar 3, 2019, 12:21

@anna-braun can you run the following code with your file?

chraibi commented 5 years ago

In Gitlab by @chraibi on Mar 3, 2019, 19:40

created branch 100-huge-trajectory-files to address this issue

chraibi commented 5 years ago

In Gitlab by @chraibi on Mar 3, 2019, 19:51

@anna-braun is this a txt or an xml file? :thinking:

chraibi commented 5 years ago

In Gitlab by @anna-braun on Mar 4, 2019, 08:16

It is txt, but it does not work for xml as well.

gjaeger commented 5 years ago

@anna-braun What's the status?

anna-braun commented 5 years ago

I guess it still does not work with huge trajectories. We solved the problem by splitting the trajectory-file every 10 MB (in jpscore).

gjaeger commented 5 years ago

@anna-braun Can we close the issue?

chraibi commented 5 years ago

No, please keep it open. jpsreport reads all the file at once, which is not good for big files of several GB.

gjaeger commented 5 years ago

@anna-braun @chraibi: Which steps are planned? What suggestions are there to edit the issue?

chraibi commented 5 years ago

I thought that a big trajectory should be read chunk-wise, not at once.

Some think like this.

gjaeger commented 5 years ago

Sounds good. Who has capacity?

Can we realize this for version 8.4?

chraibi commented 5 years ago

Probably, not. It depends on our capacity (I can't right now) and if its accute.

mirakuepper commented 5 years ago

Actually I have exactly the same problem with my trajectories files. My trajectory files (.txt) range between 200 MB and 500 MB.
To complete the analysis with jpsreport I currently cut the files after 10.000 frames. This results in data files of about 10-40 MB, otherwise jpsreport will choke on the data and generate a segmentation fault. Therefore instead of running jpsreport once for the original datafile, I have to do this up to 40 times with the 'cutted' data. Any solution to this issue would be greatly appreciated.

gjaeger commented 5 years ago

You can read the trajectory-files at the same time:

  <trajectories format="txt" unit="m">
    <file name="traj_01.txt" />
    <file name="traj_02.txt" />
    <file name="traj_03.txt" />
    <file name="traj_04.txt" />
    <file name="traj_05.txt" />
    <file name="traj_06.txt" />
    <file name="traj_07.txt" />
    <file name="traj_08.txt" />
    <file name="traj_09.txt" />
    <file name="traj_10.txt" />
    <file name="traj_11.txt" />
    <file name="traj_12.txt" />
    <file name="traj_13.txt" />
    <file name="traj_14.txt" />
    <file name="traj_15.txt" />
    <file name="traj_16.txt" />
    <file name="traj_17.txt" />
    <file name="traj_18.txt" />
    <file name="traj_19.txt" />
    <file name="traj_20.txt" />
    <file name="traj_21.txt" />
    <file name="traj_22.txt" />
    <file name="traj_23.txt" />
    <file name="traj_24.txt" />
    <file name="traj_25.txt" />
    <file name="traj_26.txt" />
    <file name="traj_27.txt" />
    <file name="traj_28.txt" />
    <file name="traj_29.txt" />
    <file name="traj_30.txt" />
    <file name="traj_31.txt" />
    <file name="traj_32.txt" />
    <file name="traj_33.txt" />
    <file name="traj_34.txt" />
    <file name="traj_35.txt" />
    <file name="traj_36.txt" />
    <file name="traj_37.txt" />
    <file name="traj_38.txt" />
    <file name="traj_39.txt" />
    <file name="traj_40.txt" />
    <file name="traj_41.txt" />
  </trajectories>

instead of

  <trajectories format="txt" unit="m">
    <file name="traj_01.txt" />
  </trajectories>

mirakuepper commented 5 years ago

@gjaeger Thank you. This workes for me!

gjaeger commented 5 years ago

The question arises to me how the speed of shared input data is calculated.

For comparison:

trajectory for one agent in one file:

# PersID    Frame   x/m y/m z/m
1   0   0.8480  5.0880  0.0000
1   1   1.4850  5.0880  0.0000
...
1   1537    999.8850    5.0880  0.0000

traj_file_total

based on IFD_I-file:

#Frame  PersID  x/m y/m z/m Individual density(m^(-2))  Individual velocity(m/s)    
00000    1   0.8480  5.0880  0.0000  1.0000  1.2740
00001    1   1.4850  5.0880  0.0000  1.0000  1.2870
00002    1   2.1350  5.0880  0.0000  1.0000  1.3000
..

The v(t)-diagram for the first frames (frame 1 to 767):

# PersID    Frame   x/m y/m z/m
1   0   0.8480  5.0880  0.0000
1   1   1.4850  5.0880  0.0000
1   2   2.1350  5.0880  0.0000
1   3   2.7850  5.0880  0.0000
...
1   766 498.7350    5.0880  0.0000
1   767 499.3850    5.0880  0.0000

traj_file_part_a

based on:

#Frame  PersID  x/m y/m z/m Individual density(m^(-2))  Individual velocity(m/s)    
00000    1   0.8480  5.0880  0.0000  1.0000  1.2740
00001    1   1.4850  5.0880  0.0000  1.0000  1.2870
00002    1   2.1350  5.0880  0.0000  1.0000  1.3000

The v(t)-diagram for the second part (frame 768 to 1537):

# PersID    Frame   x/m y/m z/m
1   768 500.0350    5.0880  0.0000
1   769 500.6850    5.0880  0.0000
1   770 501.3350    5.0880  0.0000
...
1   1536    999.2350    5.0880  0.0000
1   1537    999.8850    5.0880  0.0000

traj_file_part_b

based on:

#Frame  PersID  x/m y/m z/m Individual density(m^(-2))  Individual velocity(m/s)
00768    1   500.0350    5.0880  0.0000  1.0000  1.3000
00769    1   500.6850    5.0880  0.0000  1.0000  1.3000
00770    1   501.3350    5.0880  0.0000  1.0000  1.3000

I would have expected that the speed would also increase at the beginning.

@chraibi Does jpsreport have a memory function?

gjaeger commented 5 years ago

@schroedtert

$ lldb /Users/gjaeger/Documents/hubs/JuPedSim_github/jpsreport/bin/jpsreport ini_2019.xml

(lldb) target create "/Users/gjaeger/Documents/hubs/JuPedSim_github/jpsreport/bin/jpsreport"
Current executable set to '/Users/gjaeger/Documents/hubs/JuPedSim_github/jpsreport/bin/jpsreport' (x86_64).
(lldb) settings set -- target.run-args  "ini_2019.xml"
(lldb) run
Process 8588 launched: '/Users/gjaeger/Documents/hubs/JuPedSim_github/jpsreport/bin/jpsreport' (x86_64)
----
JuPedSim - JPSreport

Current date   : Thu Sep 05 10:49:51 2019
Version        : 0.8.4
Compiler       : g++ (8.3.0)
Commit hash    : v0.8.3-119-gbc8ced4
Commit date    : Wed Aug 14 03:47:13 2019
Branch         : develop
Python         : /opt/local/bin/python3.6 (3.6.9)
----

INFO:   Parsing the ini file <ini_2019.xml>
INFO:   logfile </Users/gjaeger/Documents/Simulationen/Mira/log.txt>
lineNr 100000
...
lineNr 10400000
Process 8588 stopped
* thread JuPedSim/jpsreport#1, queue = 'com.apple.main-thread', stop reason = signal SIGKILL
    frame #0: 0x00007fff6e4c5f49 libsystem_platform.dylib`_platform_memmove$VARIANT$Haswell + 41
libsystem_platform.dylib`_platform_memmove$VARIANT$Haswell:
->  0x7fff6e4c5f49 <+41>: rep    movsb  (%rsi), %es:(%rdi)
    0x7fff6e4c5f4b <+43>: popq   %rbp
    0x7fff6e4c5f4c <+44>: retq   
    0x7fff6e4c5f4d <+45>: cmpq   %rdi, %rsi
Target 0: (jpsreport) stopped.

chraibi commented 5 years ago

You can read the trajectory-files at the same time:

  <trajectories format="txt" unit="m">
    <file name="traj_01.txt" />
    <file name="traj_02.txt" />
    <file name="traj_03.txt" />
    <file name="traj_04.txt" />
    <file name="traj_05.txt" />
    <file name="traj_06.txt" />
    <file name="traj_07.txt" />
    <file name="traj_08.txt" />
    <file name="traj_09.txt" />
    <file name="traj_10.txt" />
    <file name="traj_11.txt" />
    <file name="traj_12.txt" />
    <file name="traj_13.txt" />
    <file name="traj_14.txt" />
    <file name="traj_15.txt" />
    <file name="traj_16.txt" />
    <file name="traj_17.txt" />
    <file name="traj_18.txt" />
    <file name="traj_19.txt" />
    <file name="traj_20.txt" />
    <file name="traj_21.txt" />
    <file name="traj_22.txt" />
    <file name="traj_23.txt" />
    <file name="traj_24.txt" />
    <file name="traj_25.txt" />
    <file name="traj_26.txt" />
    <file name="traj_27.txt" />
    <file name="traj_28.txt" />
    <file name="traj_29.txt" />
    <file name="traj_30.txt" />
    <file name="traj_31.txt" />
    <file name="traj_32.txt" />
    <file name="traj_33.txt" />
    <file name="traj_34.txt" />
    <file name="traj_35.txt" />
    <file name="traj_36.txt" />
    <file name="traj_37.txt" />
    <file name="traj_38.txt" />
    <file name="traj_39.txt" />
    <file name="traj_40.txt" />
    <file name="traj_41.txt" />
  </trajectories>

instead of

  <trajectories format="txt" unit="m">
    <file name="traj_01.txt" />
  </trajectories>

I think it's enough to specify a directory. No need to write all the names of files one by one.

chraibi commented 5 years ago

Actually I have exactly the same problem with my trajectories files. My trajectory files (.txt) range between 200 MB and 500 MB. To complete the analysis with jpsreport I currently cut the files after 10.000 frames. This results in data files of about 10-40 MB, otherwise jpsreport will choke on the data and generate a segmentation fault. Therefore instead of running jpsreport once for the original datafile, I have to do this up to 40 times with the 'cutted' data. Any solution to this issue would be greatly appreciated.

There is a solution. jpsreport reads all files and stores it's content at once, which is not a good idea for large files.

This needs to be changed, by reading chunks of data and process them, chunk by chunk. See also here for some ideas.

Anyone willing to tackle this is welcome to contribute.

chraibi commented 5 years ago

The question arises to me how the speed of shared input data is calculated. @chraibi Does jpsreport have a memory function?

I don't know what you mean with shared input, but I don't think that jpsreport have a "memory function".

gjaeger commented 5 years ago

@chraibi By shared input I mean the division into individual files. I wonder if the shared trajectory files are read independently? Otherwise I cannot explain to myself that the calculation of the velocity/speed takes place without loss of the knowledge of previous frames. See my example above.

chraibi commented 5 years ago

I think jpsreport reads one file at once, processes it then writes the results out. Then starts again this mechanics for other files, independently from each other.

What @mirakuepper is doing is a smart hack, but of course works only, if you are OK with these discontinuities in the results.

gjaeger commented 5 years ago

I think jpsreport reads one file at once, processes it then writes the results out. Then starts again this mechanics for other files, independently from each other.

If the mechanism would start independently, then I expect the speed in the second part (see v(t)-diagram for the second part) to increase as well (see see v(t)-diagram for the first part). This is not the case. The analysis shows no discontinuities with respect to movement speed.

chraibi commented 4 years ago

Brainstorming: https://stackoverflow.com/questions/17925051/fast-textfile-reading-in-c https://stackoverflow.com/questions/34751873/how-to-read-huge-file-in-c https://www.reddit.com/r/cpp/comments/318m4n/how_to_read_a_huge_file_fast/

JuPedSim / jpsreport

Huge trajectory files can not be processed #203