alekseyzimin / masurca

GNU General Public License v3.0
240 stars 35 forks source link

chromosome_scaffolder - turn off splitting - border-less-gap-filling #312

Open kullrich opened 1 year ago

kullrich commented 1 year ago

Dear @alekseyzimin,

I was wondering if it would be possible to add an option to completely turn off query scaffold splitting if one chooses to use the -nb option.

The splitting option was changed from version 4.0.9 to version 4.1.0 from 100bp to 10,000bp.

Since I am starting with Bionano scaffolded scaffolds to be re-oriented against a reference, the N-stretch in the query sequences should encode at least a rough estimate about the distance of contigs and splitting them would remove this information.

Also during the re-orientation step, these scaffolds should not be split, however merging with e.g. smaller contigs/scaffolds which would be located inside a N-stretch as positioned by the reference mapping, of course should gap-fill (N-borders would still exists, it would be in a way a gap-filling without borders).

Why Bionano hybrid-scaffolding failed to place a contig/scaffold in that region is another question, which does not need to be discussed here.

I wonder what happens if the scaffolding due to your pipeline would still be valid and if the merge step (border-less-gap-fill) would work?

I just tried to turn this step off in the bash script but I wanted to re-check with you, if theoretically the merging is still valid.

if [ ! -e $PREFIX.split.success ];then
  log "Skip: Splitting query scaffolds at >10000bp gaps"
  rm -f $PREFIX.readalign.success
  #$MYPATH/splitScaffoldsAtNs.sh $QRY 10000 > $HYB_CTG && \
  cat $QRY > $HYB_CTG && \
  touch $PREFIX.split.success
fi

Thank you in anticipation

Best regards

Kristian

kullrich commented 1 year ago
border-less-gap-fill
alekseyzimin commented 1 year ago

Hi, you can turn off the step by either bypassing the command or by increasing the minimum gap size to something big, such as 10000000. Gap filling should be done with close_scaffold_gaps script which is a wrapper for SAMBA scaffolder. chromosome scaffolder treats scaffolds as blocks and it is not allowed to place anything inside the gaps, that is one reason why I split on big gaps, as those gaps may be unreliable. Best, Aleksey

On Wed, Jan 18, 2023 at 1:31 PM Kristian Ullrich @.***> wrote:

Dear @alekseyzimin https://github.com/alekseyzimin,

I was wondering if it would be possible to add an option to completely turn off query scaffold splitting if one chooses to use the -nb option.

The splitting option was changed from version 4.0.9 to version 4.1.0 from 100bp to 10,000bp.

Since I am starting with Bionano scaffolded scaffolds to be re-oriented against a reference, the N-stretch in the query sequences should encode at least a rough estimate about the distance of contigs and splitting them would remove this information.

Also during the re-orientation step, these scaffolds should not be split, however merging with e.g. smaller contigs/scaffolds which would be located inside a N-stretch as positioned by the reference mapping, of course should gap-fill (N-borders would still exists, it would be in a way a gap-filling without borders).

I wonder what happens if the scaffolding due to your pipeline would still be valid and if the merge step (border-less-gap-fill) would work?

I just tried to turn this step off in the bash script but I wanted to re-check with you, if theoretically the merging is still valid.

Thank you in anticipation

Best regards

Kristian

— Reply to this email directly, view it on GitHub https://github.com/alekseyzimin/masurca/issues/312, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGPXGHNU32QTNW6476LNIM3WTAZIVANCNFSM6AAAAAAT7NHIOQ . You are receiving this because you were mentioned.Message ID: @.***>

-- Dr. Alexey V. Zimin Associate Research Scientist Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA (301)-437-6260 website http://ccb.jhu.edu/people/alekseyz/ blog http://masurca.blogspot.com

kullrich commented 1 year ago

Thank you for the response.

I know the close_scaffold_gaps script, anyhow, in this situation it will not help, since the reference should not be used for gap-filling and it relies on gap-borders. The reference should only be used to find, place, re-orient the query-contigs/super-scaffolds. I take the super-scaffold including N-stretch gaps and gap length as reliable, since they are validated on long molecule level via Bionano.

If, e.g. a query-super-scaffold-1 is placed on the reference and both N-stretch borders (left-right) are found and placed. And a second contig /super-scaffold-2 is placed inside these borders using the reference positions, than the super-scaffold-2 should be merged and placed inside the N-stretch of super-scaffold-1, in best at the same left-right-border relative position as the borders detected for super-scaffold-1.

Maybe you could pin-point the step in the chromosome-scaffold-script, where I could try to place-in this special use-case?

Thank you in anticipation

Best regards

Kristian