DustinSokolowski commented 3 years ago

Thank you for this great tool.

I am running the juicer.sh pipeline using the PBS version and I am on the deduplication step. It submits about 100 jobs, 99 of them finish in ~10 hours. There is one final job however that has been running for 130+ hours and shows no signs of stopping. I looked into an type previous error that may have this issue and the closest I saw was to add a "j" flag (https://github.com/aidenlab/juicer/issues/169). When I tried to reproduce this, I see that -j is now an illegal flag. Is it possible that this flag was not carried over to the PBS version (or is no longer needed?). Otherwise, is there a way to speed up the deduplication step or cancel this single job and let the remaining 99+% of the Hi-C data deduplicate and align?

Thanks so much, Dustin

nchernia commented 3 years ago

It probably never got it in the first place.

I would kill that job (make sure you keep track of which split it’s running on).

Then use the below to remove dups just on that file. Then concatenate all the merged_dups, dups, and opt_dups (via cat) into the three big files. And relaunch juicer at -S final.

  # Usage:

awk -v nodupfile="nodups" -v dupfile="dups" -f rmdups.awk

Reads infile, writes two files, "nodups" and "dups", where the duplicates

are stored in dups.

absolute value

function abs(v) {

return v<0?-v:v;

}

for every line

BEGIN {

dupname=name"dups.txt";

nodupname=name"merged_nodups.txt";

}

{

if strand, chromosome match previous line and both positions are within 10

it's a dup, otherwise nodup

if ($1!=p1 || $2 != p2 || $3 != p3 || $4 != p4 || $5 != p5 || $6

!= p6 || $7 != p7 || $8 != p8){

print > nodupname

}

else {

print > dupname

}

}

assign previous whether dup or nodup

{

p1=$1;p2=$2;p3=$3;p4=$4;p5=$5;p6=$6;p7=$7;p8=$8

}

On Fri, Mar 19, 2021 at 1:58 PM Dustin Sokolowski @.***> wrote:

Thank you for this great tool.

I am running the juicer.sh pipeline using the PBS version and I am on the deduplication step. It submits about 100 jobs, 99 of them finish in ~10 hours. There is one final job however that has been running for 130+ hours and shows no signs of stopping. I looked into an type previous error that may have this issue and the closest I saw was to add a "j" flag (#169 https://github.com/aidenlab/juicer/issues/169). When I tried to reproduce this, I see that -j is now an illegal flag. Is it possible that this flag was not carried over to the PBS version (or is no longer needed?). Otherwise, is there a way to speed up the deduplication step or cancel this single job and let the remaining 99+% of the Hi-C data deduplicate and align?

Thanks so much, Dustin

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/aidenlab/juicer/issues/214, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAK2EW3TZID7ORYRVI7DUD3TEOGE3ANCNFSM4ZPH5TEQ .

-- Neva Cherniavsky Durand, Ph.D. | she, her, hers Assistant Professor | Molecular and Human Genetics Aiden Lab | Baylor College of Medicine www.aidenlab.org

DustinSokolowski commented 3 years ago

Hey!

Thank you so much for the quick reply. I'll let you know how it works.

Best! Dustin

DustinSokolowski commented 3 years ago

Hey!

I'm sorry to message twice. I noticed that at this stage I have _dups.txt and _nodups.txt files but no opt_dups.txt files yet.

Looking into the "split_rmdups.awk" It looks like those opt_dups.txt files should come at the end step of deduplication and after _dups.txt and nodups.txt were made? Since _dup.txt and _nodups.txt didn't get the "all ok" (because one split got stuck), couldn't generate the opt_dups files.

Do you know if there is any work-around for this?

nchernia commented 3 years ago

No, they should be created at the same time. They might not be there if the program is unable to determine opt dups (depends on readname).

I would concat the dups and merged_nodups and count the lines and see if it adds up to the number of lines in merged_sort (or you could also try ls -l on all 3 and see if it adds up)

On Fri, Mar 19, 2021 at 2:34 PM Dustin Sokolowski @.***> wrote:

Hey!

I'm sorry to message twice. I noticed that at this stage I have _dups.txt and _nodups.txt files but no opt_dups.txt files yet.

Looking into the "split_rmdups.awk" It looks like those opt_dups.txt files should come at the end step of deduplication and after _dups.txt and nodups.txt were made? Since _dup.txt and _nodups.txt didn't get the "all ok" (because one split got stuck), couldn't generate the opt_dups files.

Do you know if there is any work-around for this?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/aidenlab/juicer/issues/214#issuecomment-803032251, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAK2EWYSCLIFKPX6KV356CDTEOKMXANCNFSM4ZPH5TEQ .

-- Neva Cherniavsky Durand, Ph.D. | she, her, hers Assistant Professor | Molecular and Human Genetics Aiden Lab | Baylor College of Medicine www.aidenlab.org

DustinSokolowski commented 3 years ago

Hey!

Thank you for your prompt reply.

It looks like the combination of _dups.txt and _nodups.txt is close to the size of merged_sort (99.2% of the size).

Starting from the "align" directory:

ls -l *dups.txt > dups_size.txt 
cat dups_size.txt | awk '{ print $5 }' | paste -sd+

160217286868

ls -l merged_sort.txt

161559666042

For a given split, dups and nonodups are not making the same size file:

1993809 Mar 13 17:07 C68502_msplit0450_dups.txt

103420185 Mar 13 17:07 C68502_msplit0450_merged_nodups.txt

Lastly I've attached the first 1000 lines of merged_sort.txt to see if it's lacking the necessary readname.

If we do not need the opt files can we proceed without them?

Best! Dustin merged_sort1k.txt

nchernia commented 3 years ago

You don’t need to worry about opt dups.

Did you already cat to create merged_nodups? Here is what I’m suggesting you verify (set the outputdir var to aligned for example)

Check the sizes of merged_sort versus the dups/no dups files to be sure

no reads were lost

total=1 total2=0 total=ls -l ${outputdir}/merged_sort.txt | awk '{print $5}' total2=ls -l ${outputdir}/merged_nodups.txt ${outputdir}/dups.txt ${outputdir}/opt_dups.txt | awk '{sum = sum + $5}END{print sum}'

if [ -z $total ] || [ -z $total2 ] || [ $total -ne $total2 ] then echo "***! Error! The sorted file and dups/no dups files do not add up, or were empty. Merge or dedupping likely failed, restart pipeline with -S merge or -S dedup" echo "Dups don't add up. Check ${outputdir} for results" exit 1 fi

On Fri, Mar 19, 2021 at 2:50 PM Dustin Sokolowski @.***> wrote:

Hey!

Thank you for your prompt reply.

It looks like the combination of _dups.txt and _nodups.txt is twice the size of merged_sort.

Starting from the "align" directory:

ls -l *dups.txt > dups_size.txt cat dups_size.txt | awk '{ print $5 }' | paste -sd+

320434573736

ls -l merged_sort.txt

161559666042

For a given split, dups and nonodups are not making the same size file:

1993809 Mar 13 17:07 C68502_msplit0450_dups.txt

103420185 Mar 13 17:07 C68502_msplit0450_merged_nodups.txt

Lastly I've attached the first 1000 lines of merged_sort.txt to see if it's lacking the necessary readname.

If we do not need the opt files can we proceed without them?

Best! Dustin merged_sort1k.txt https://github.com/aidenlab/juicer/files/6173474/merged_sort1k.txt

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/aidenlab/juicer/issues/214#issuecomment-803041277, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAK2EW7YJB34O2XWJRS5WDLTEOMI3ANCNFSM4ZPH5TEQ .

-- Neva Cherniavsky Durand, Ph.D. | she, her, hers Assistant Professor | Molecular and Human Genetics Aiden Lab | Baylor College of Medicine www.aidenlab.org

DustinSokolowski commented 3 years ago

Hey!

Thanks again. I ran your script and it looks like they do not add up.

The sum of dups.txt and merged_nodups.txt are 99% the size of merged_sort.txt

With that in mind is it possible that the single split is working it's way through really slowly for some reason? I tried to restart this a couple times and the same split slowed down to a crawl each time.

Best, Dustin

nchernia commented 3 years ago

So I would stop that split (the slow one) and run the rmdups.awk script I sent, replacing the corresponding mplit_merged_nodups and msplit_dups

Alternatively you could just run the rmdups.awk script diiectly on the merged_sort. It might not take too long depending on how big merged_sort is.

On Fri, Mar 19, 2021 at 5:33 PM Dustin Sokolowski @.***> wrote:

Hey!

Thanks again. I ran your script and it looks like they do not add up.

The sum of dups.txt and merged_nodups.txt are 99% the size of merged_sort.txt

With that in mind is it possible that the single split is working it's way through really slowly for some reason? I tried to restart this a couple times and the same split slowed down to a crawl each time.

Best, Dustin

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/aidenlab/juicer/issues/214#issuecomment-803146254, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAK2EW65IXNYDRKOD26WAJTTEO7MDANCNFSM4ZPH5TEQ .

-- Neva Cherniavsky Durand, Ph.D. | she, her, hers Assistant Professor | Molecular and Human Genetics Aiden Lab | Baylor College of Medicine www.aidenlab.org

DustinSokolowski commented 3 years ago

Thanks so much, I'll give it a shot!

DustinSokolowski commented 3 years ago

Hey

Thank you again! I think that it worked!

I have one last question if that's alright. In the subsequent steps I got an error saying that there are no reads in the Hi-C contact matrices "java.lang.RuntimeException: No reads in Hi-C contact matrices. This could be because the MAPQ filter is set too high (-q) or because all reads map to the same fragment."

Is this section required to make the .hic file for genome assembly? Furthermore since i set -s "none" should this be run? Lastly is there a way of checking if the hic file was properly made or should I just continue to 3D-dna and check the output there?

Thanks! Dustin

nchernia commented 3 years ago

If you’re using 3D DNA you can stop once the merged nodups is created.

On Sun, Mar 21, 2021 at 12:30 PM Dustin Sokolowski @.***> wrote:

Hey

Thank you again! I think that it worked!

I have one last question if that's alright. In the subsequent steps I got an error saying that there are no reads in the Hi-C contact matrices "java.lang.RuntimeException: No reads in Hi-C contact matrices. This could be because the MAPQ filter is set too high (-q) or because all reads map to the same fragment."

Is this section required to make the .hic file for genome assembly? Furthermore since i set -s "none" should this be run? Lastly is there a way of checking if the hic file was properly made or should I just continue to 3D-dna and check the output there?

Thanks! Dustin

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/aidenlab/juicer/issues/214#issuecomment-803616535, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAK2EW7TZNKHTE6QGPDAJBLTEYNKPANCNFSM4ZPH5TEQ .

-- Neva Cherniavsky Durand, Ph.D. | she, her, hers Assistant Professor | Molecular and Human Genetics Aiden Lab | Baylor College of Medicine www.aidenlab.org

aidenlab / juicer

PBS juicer no longer has a -j flag #214

awk -v nodupfile="nodups" -v dupfile="dups" -f rmdups.awk

Reads infile, writes two files, "nodups" and "dups", where the duplicates

are stored in dups.

absolute value

for every line

if strand, chromosome match previous line and both positions are within 10

it's a dup, otherwise nodup

assign previous whether dup or nodup

Check the sizes of merged_sort versus the dups/no dups files to be sure

no reads were lost